├── 0_Running_TensorFlow_In_SageMaker.ipynb
├── 0_Running_TensorFlow_In_SageMaker_tf2.ipynb
├── 1_Monitoring_your_TensorFlow_scripts.ipynb
├── 2_Using_Pipemode_input_for_big_datasets.ipynb
├── 3_Distributed_training_with_Horovod.ipynb
├── 4_Deploying_your_TensorFlow_model.ipynb
├── CODE_OF_CONDUCT.md
├── CONTRIBUTING.md
├── LICENSE
├── README.md
├── images
    ├── TFRecord.png
    └── tensorboard.png
├── local_mode_setup.sh
└── training_script
    ├── cifar10_keras.py
    ├── cifar10_keras_dist_solution.py
    ├── cifar10_keras_pipe_solution.py
    ├── cifar10_keras_sm_solution.py
    ├── cifar10_keras_sm_tf2.py
    ├── cifar10_keras_sm_tf2_solution.py
    ├── cifar10_keras_tensorboard_solution.py
    └── cifar10_keras_tf2.py


/0_Running_TensorFlow_In_SageMaker.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# [Module 1] Train a Keras Sequential Model\n",
  8 |     "\n",
  9 |     "본 노트북(notebook)은 SageMaker 상에서 Keras Sequential model을 학습하는 방법을 단계별로 설명합니다. 본 노트북에서 사용한 모델은 간단한 deep CNN(Convolutional Neural Network) 모델로 [the Keras examples](https://github.com/keras-team/keras/blob/master/examples/cifar10_cnn.py)에 소개된 모델과 동일합니다.\n",
 10 |     "- 참고로, 본 모델은 25 epoch 학습 후에 검증셋의 정확도(accuracy)가 약 75%이고 50 epoch 학습 후에 검증셋의 정확도가 약 79% 입니다.\n",
 11 |     "- 본 워크샵 과정에서는 시간 관계상 5 epoch까지만 학습합니다. (단, Horovod 기반 분산 학습은 10 epoch까지 학습합니다.)"
 12 |    ]
 13 |   },
 14 |   {
 15 |    "cell_type": "markdown",
 16 |    "metadata": {},
 17 |    "source": [
 18 |     "## The dataset\n",
 19 |     "\n",
 20 |     "[CIFAR-10 dataset](https://www.cs.toronto.edu/~kriz/cifar.html)은 머신 러닝에서 가장 유명한 데이터셋 중 하나입니다.\n",
 21 |     "이 데이터셋은 10개의 다른 클래스로 구성된(클래스당 6,000장) 60,000장의 32x32 픽셀 이미지들로 구성되어 있습니다.\n",
 22 |     "아래 그림은 클래스당 10장의 이미지들을 랜덤으로 추출한 결과입니다. \n",
 23 |     "\n",
 24 |     "![cifar10](https://maet3608.github.io/nuts-ml/_images/cifar10.png)\n",
 25 |     "\n",
 26 |     "본 실습에서 여러분들은 deep CNN을 학습하여 영상 분류(image classification) 작업을 수행합니다. 다음 노트북들에서\n",
 27 |     "여러분들은 File Mode, Pipe Mode와 Horovod 기반 분산 학습(distributed training) 결과를 비교할 것입니다."
 28 |    ]
 29 |   },
 30 |   {
 31 |    "cell_type": "markdown",
 32 |    "metadata": {},
 33 |    "source": [
 34 |     "## Getting the data\n",
 35 |     "아래 AWS CLI(Command Line Interface) 커맨드를 사용하여 S3(Amazon Simple Storage Service)에 저장된 TFRecord 데이터셋을 여러분의 로컬 노트북 인스턴스로 복사합니다.\n",
 36 |     "S3 경로는 `s3://floor28/data/cifar10` 입니다. \n",
 37 |     "\n",
 38 |     "### TFRecord는 무엇인가요?\n",
 39 |     "- Google에서 Tensorflow backend로 모델링 시에 공식적으로 권장하는 binary 포맷입니다.\n",
 40 |     "- Tensorflow의 protocol buffer 파일로 직렬화된 입력 데이터가 담겨 있습니다.\n",
 41 |     "- 대용량 데이터를 멀티스레딩으로 빠르게 스트리밍할 때 유용합니다. (모든 데이터는 메모리의 하나의 블록에 저장되므로, 입력 파일이 개별로 저장된 경우보다 데이터 로딩에 필요한 시간이 대폭 단축됩니다.)\n",
 42 |     "- Example 객체로 구성된 배열의 집합체입니다. (an array of Examples)\n",
 43 |     "- 아래 그림은 $m$차원 feautre가 $n$개의 샘플로 구성된 TFRecord 예시입니다.\n",
 44 |     "![TFRecord](./images/TFRecord.png)"
 45 |    ]
 46 |   },
 47 |   {
 48 |    "cell_type": "code",
 49 |    "execution_count": null,
 50 |    "metadata": {},
 51 |    "outputs": [],
 52 |    "source": [
 53 |     "!aws s3 cp --recursive s3://floor28/data/cifar10 ./data"
 54 |    ]
 55 |   },
 56 |   {
 57 |    "cell_type": "markdown",
 58 |    "metadata": {},
 59 |    "source": [
 60 |     "## Run the training locally"
 61 |    ]
 62 |   },
 63 |   {
 64 |    "cell_type": "markdown",
 65 |    "metadata": {},
 66 |    "source": [
 67 |     "본 스크립트는 모델 학습에 필요한 인자값(arguments)들을 사용합니다. 모델 학습에 필요한 인자값들은 아래와 같습니다.\n",
 68 |     "\n",
 69 |     "1. `model_dir` - 로그와 체크 포인트를 저장하는 경로\n",
 70 |     "2. `train, validation, eval` - TFRecord 데이터셋을 저장하는 경로\n",
 71 |     "3. `epochs` - epoch 횟수\n",
 72 |     "\n",
 73 |     "아래 명령어로 **<font color='red'>SageMaker 관련 API 호출 없이</font>** 로컬 노트북 인스턴스 환경에서 1 epoch만 학습해 봅니다. 참고로, MacBook Pro(15-inch, 2018) 2.6GHz Core i7 16GB 사양에서 2분 20초~2분 40초 소요됩니다."
 74 |    ]
 75 |   },
 76 |   {
 77 |    "cell_type": "code",
 78 |    "execution_count": null,
 79 |    "metadata": {},
 80 |    "outputs": [],
 81 |    "source": [
 82 |     "%%time\n",
 83 |     "!mkdir -p logs\n",
 84 |     "!python training_script/cifar10_keras.py --model_dir ./logs \\\n",
 85 |     "                                         --train data/train \\\n",
 86 |     "                                         --validation data/validation \\\n",
 87 |     "                                         --eval data/eval \\\n",
 88 |     "                                         --epochs 1\n",
 89 |     "!rm -rf logs"
 90 |    ]
 91 |   },
 92 |   {
 93 |    "cell_type": "markdown",
 94 |    "metadata": {},
 95 |    "source": [
 96 |     "**<font color='blue'>본 스크립트는 SageMaker상의 notebook에서 구동하고 있지만, 여러분의 로컬 컴퓨터에서도 python과 jupyter notebook이 정상적으로 인스톨되어 있다면 동일하게 수행 가능합니다.</font>**"
 97 |    ]
 98 |   },
 99 |   {
100 |    "cell_type": "markdown",
101 |    "metadata": {
102 |     "toc-hr-collapsed": false
103 |    },
104 |    "source": [
105 |     "## Use TensorFlow Script Mode\n",
106 |     "\n",
107 |     "TensorFlow 버전 1.11 이상에서 Amazon SageMaker Python SDK는 **스크립트 모드(Script mode)**를 지원합니다. 스크립트 모드는 종래 레거시 모드(Legacy mode) 대비 아래 장점들이 있습니다.\n",
108 |     "\n",
109 |     "* 스크립트 모드의 학습 스크립트는 일반적으로 TensorFlow 용으로 작성하는 학습 스크립트와 더 유사하므로 TensorFlow 학습 스크립트를 최소한의 변경으로 실행할 수 있습니다. 따라서, 기존 레거시 모드보다 TensorFlow 학습 스크립트를 수정하는 것이 더 쉽습니다. \n",
110 |     "    - 레거시 모드는 Tensorflow Estimator API를 기반으로 한 아래의 함수들을 반드시 포함해야 합니다.\n",
111 |     "        - 아래 함수들에서 하나의 함수를 만드시 포함해야 합니다.\n",
112 |     "            - `model_fn`: 학습할 모델을 정의합니다,\n",
113 |     "            - `keras_model_fn`: 학습할 tf.keras 모델을 정의합니다.\n",
114 |     "            - `estimator_fn`: 학습할 tf.estimator.Estimator를 정의합니다.\n",
115 |     "        - `train_input_fn`: 학습 데이터 로딩과 전처리를 수행합니다. \n",
116 |     "        - `eval_input_fn`: 검증 데이터의 로딩과 전처리를 수행합니다.\n",
117 |     "        - (Optional) `serving_input_fn`: 예측(prediction) 중에 모델에 전달할 feautre를 정의합니다. 이 함수는 학습시에만 사용되지만, SageMaker 엔드포인트에서 모델을 배포할 때 필요합니다.\n",
118 |     "    - `if __name__ == “__main__”:` 블록을 정의할 수 없어 디버깅이 쉽지 않습니다.\n",
119 |     "    \n",
120 |     "* 스크립트 모드는 Python 2.7-와 Python 3.6-을 지원합니다.\n",
121 |     "\n",
122 |     "* 스크립트 모드는 **Hovorod 기반 분산 학습(distributed training)도 지원**합니다.\n",
123 |     "\n",
124 |     "TensorFlow 스크립트 모드에서 학습 스크립트를 작성하는 방법 및 Tensorflow 스크립트 모드의 estimator와 model 사용법에 대한 자세한 내용은\n",
125 |     "https://sagemaker.readthedocs.io/en/stable/using_tf.html 을 참조하세요."
126 |    ]
127 |   },
128 |   {
129 |    "cell_type": "markdown",
130 |    "metadata": {},
131 |    "source": [
132 |     "### Preparing your script for training in SageMaker\n",
133 |     "\n",
134 |     "SageMaker 스크립트 모드의 학습 스크립트는 SageMaker 외부에서 실행할 수 있는 학습 스크립트와 매우 유사합니다.\n",
135 |     "SageMaker는 하나의 인자값(argument), model_dir와 로그 및 모델 아티팩트(model artifacts)에 사용되는 S3 경로로 학습 스크립트를 실행합니다.\n",
136 |     "\n",
137 |     "SageMaker 학습 인스턴스에서는 학습의 컨테이너에 S3에 저장된 데이터를 다운로드하여 학습에 활용합니다. 그 때, S3 버킷의 데이터 경로와 컨테이너의 데이터 경로를 컨테이너 환경 변수를 통해 연결합니다.\n",
138 |     "\n",
139 |     "여러분은 다양한 환경 변수를 통해 학습 환경에 대한 유용한 속성들(properties)에 액세스할 수 있습니다.\n",
140 |     "이 스크립트의 경우 `Train, Validation, Eval`이라는 3 개의 데이터 채널을 스크립트로 보냅니다.\n",
141 |     "\n",
142 |     "**`training_script/cifar10_keras.py`에서 스크립트 사본을 생성 후, `training_script/cifar10_keras_sm.py`로 저장하세요.**\n",
143 |     "\n",
144 |     "스크립트 사본을 생성하였다면 단계별로 아래의 작업들을 직접 시도합니다.\n",
145 |     "\n",
146 |     "----\n",
147 |     "### TODO 1.\n",
148 |     "`cifar10_keras_sm.py`파일에서 SageMaker API 환경 변수 SM_CHANNEL_TRAIN, SM_CHANNEL_VALIDATION, SM_CHANNEL_EVAL에서 디폴트 값을 가져오기 위해 train, validation, eval 인수를 수정해 주세요. \n",
149 |     "\n",
150 |     "`cifar10_keras_sm.py`의 `if __name__ == '__main__':` 블록 내에 아래 인자값을 수정해 주세요.\n",
151 |     "\n",
152 |     "```python\n",
153 |     "parser.add_argument(\n",
154 |     "        '--train',\n",
155 |     "        type=str,\n",
156 |     "        required=False,\n",
157 |     "        default=os.environ.get('SM_CHANNEL_TRAIN'), # <-- 수정 부분\n",
158 |     "        help='The directory where the CIFAR-10 input data is stored.')\n",
159 |     "parser.add_argument(\n",
160 |     "        '--validation',\n",
161 |     "        type=str,\n",
162 |     "        required=False,\n",
163 |     "        default=os.environ.get('SM_CHANNEL_VALIDATION'), # <-- 수정 부분\n",
164 |     "        help='The directory where the CIFAR-10 input data is stored.')\n",
165 |     "parser.add_argument(\n",
166 |     "        '--eval',\n",
167 |     "        type=str,\n",
168 |     "        required=False,\n",
169 |     "        default=os.environ.get('SM_CHANNEL_EVAL'), # <-- 수정 부분\n",
170 |     "        help='The directory where the CIFAR-10 input data is stored.')\n",
171 |     "```\n",
172 |     "\n",
173 |     "\n",
174 |     "환경 변수에 따른 S3 경로와 컨테이너 경로는 아래 표와 같습니다.\n",
175 |     "\n",
176 |     "|  S3 경로  |  환경 변수  |  컨테이너 경로  |\n",
177 |     "| :---- | :---- | :----| \n",
178 |     "|  s3://bucket_name/prefix/train  |  `SM_CHANNEL_TRAIN`  | `/opt/ml/input/data/train`  |\n",
179 |     "|  s3://bucket_name/prefix/validation  |  `SM_CHANNEL_VALIDATION`  | `/opt/ml/input/data/validation`  |\n",
180 |     "|  s3://bucket_name/prefix/eval  |  `SM_CHANNEL_EVAL`  | `/opt/ml/input/data/eval`  |\n",
181 |     "|  s3://bucket_name/prefix/model.tar.gz  |  `SM_MODEL_DIR`  |  `/opt/ml/model`  |\n",
182 |     "|  s3://bucket_name/prefix/output.tar.gz  |  `SM_OUTPUT_DATA_DIR`  |  `/opt/ml/output/data`  |\n",
183 |     "\n",
184 |     "얘를 들어, `/opt/ml/input/data/train`은 학습 데이터가 다운로드되는 컨테이너 내부의 디렉토리입니다.\n",
185 |     "\n",
186 |     "자세한 내용은 아래의 SageMaker Python SDK 문서를 확인하시기 바랍니다.<br>\n",
187 |     "(https://sagemaker.readthedocs.io/en/stable/using_tf.html#preparing-a-script-mode-training-script)\n",
188 |     "\n",
189 |     "\n",
190 |     "SageMaker는 train, validation, eval 경로들을 직접 인자로 보내지 않고, 대신 스크립트에서 환경 변수를 사용하여 해당 인자를 필요하지 않은 것으로 표시합니다.\n",
191 |     "\n",
192 |     "SageMaker는 유용한 환경 변수를 여러분이 작성한 학습 스크립트로 보냅니다. 예시들은 아래와 같습니다.\n",
193 |     "* `SM_MODEL_DIR`: 학습 작업이 모델 아티팩트(model artifacts)를 저장할 수 있는 로컬 경로를 나타내는 문자열입니다. 학습 완료 후, 해당 경로 내 모델 아티팩트는 모델 호스팅을 위해 S3에 업로드됩니다. 이는 S3 위치인 학습 스크립트에 전달 된 model_dir 인수와 다르다는 점을 주의해 주세요. SM_MODEL_DIR은 항상 `/opt/ml/model`로 설정됩니다.\n",
194 |     "* `SM_NUM_GPUS`: 호스트(Host)에서 사용 가능한 GPU 수를 나타내는 정수(integer)입니다.\n",
195 |     "* `SM_OUTPUT_DATA_DIR`: 출력 아티팩트를 저장할 디렉토리의 경로를 나타내는 문자열입니다. 출력 아티팩트에는 체크포인트, 그래프 및 다른 저장용 파일들이 포함될 수 있지만 모델 아티팩트는 포함되지 않습니다. 이 출력 아티팩트들은 압축되어 모델 아티팩트와 동일한 접두사가 있는 S3 버킷으로 S3에 업로드됩니다.\n",
196 |     "\n",
197 |     "이 샘플 코드는 네트워크 지연을 줄이기 위해 모델의 체크포인트(checkpoints)를 로컬 환경에 저장합니다. 이들은 학습 종료 후 S3에 업로드할 수 있습니다.\n",
198 |     "\n",
199 |     "----\n",
200 |     "### TODO 2.\n",
201 |     "\n",
202 |     "`cifar10_keras_sm.py`의 `if __name__ == '__main__':` 블록 내에 아래 인자값을 추가해 주세요.\n",
203 |     "\n",
204 |     "```python\n",
205 |     "parser.add_argument(\n",
206 |     "        '--model_output_dir',\n",
207 |     "        type=str,\n",
208 |     "        default=os.environ.get('SM_MODEL_DIR'))\n",
209 |     "```\n",
210 |     "\n",
211 |     "----\n",
212 |     "### TODO 3.\n",
213 |     "`ModelCheckpoint` 함수의 저장 경로를 새 경로로 아래와 같이 수정해 주세요.\n",
214 |     "\n",
215 |     "From:\n",
216 |     "```python\n",
217 |     "callbacks.append(ModelCheckpoint(args.model_dir + '/checkpoint-{epoch}.h5'))\n",
218 |     "```\n",
219 |     "To:\n",
220 |     "```python\n",
221 |     "callbacks.append(ModelCheckpoint(args.model_output_dir + '/checkpoint-{epoch}.h5'))\n",
222 |     "```\n",
223 |     "\n",
224 |     "----\n",
225 |     "### TODO 4.\n",
226 |     "`save_model` 함수의 인자값을 아래와 같이 수정해 주세요.\n",
227 |     "\n",
228 |     "From:  \n",
229 |     "```python\n",
230 |     "return save_model(model, args.model_dir)\n",
231 |     "```\n",
232 |     "To:  \n",
233 |     "```python\n",
234 |     "return save_model(model, args.model_output_dir)\n",
235 |     "```\n",
236 |     "\n",
237 |     "<font color='blue'>**본 노트북 실습에 어려움이 있다면 솔루션 파일 `training_script/cifar10_keras_sm_solution.py`을 참조하시면 됩니다.**</font>"
238 |    ]
239 |   },
240 |   {
241 |    "cell_type": "markdown",
242 |    "metadata": {},
243 |    "source": [
244 |     "### Test your script locally (just like on your laptop)\n",
245 |     "\n",
246 |     "테스트를 위해 위와 동일한 명령(command)으로 새 스크립트를 실행하고, 예상대로 실행되는지 확인합니다. <br>\n",
247 |     "SageMaker TensorFlow API 호출 시에 환경 변수들은 자동으로 넘겨기지만, 로컬 주피터 노트북에서 테스트 시에는 수동으로 환경 변수들을 지정해야 합니다. (아래 예제 코드를 참조해 주세요.)\n",
248 |     "\n",
249 |     "```python\n",
250 |     "%env SM_MODEL_DIR=./logs\n",
251 |     "```"
252 |    ]
253 |   },
254 |   {
255 |    "cell_type": "code",
256 |    "execution_count": null,
257 |    "metadata": {},
258 |    "outputs": [],
259 |    "source": [
260 |     "%%time\n",
261 |     "!mkdir -p logs   \n",
262 |     "\n",
263 |     "# Number of GPUs on this machine\n",
264 |     "%env SM_NUM_GPUS=1\n",
265 |     "# Where to save the model\n",
266 |     "%env SM_MODEL_DIR=./logs\n",
267 |     "# Where the training data is\n",
268 |     "%env SM_CHANNEL_TRAIN=data/train\n",
269 |     "# Where the validation data is\n",
270 |     "%env SM_CHANNEL_VALIDATION=data/validation\n",
271 |     "# Where the evaluation data is\n",
272 |     "%env SM_CHANNEL_EVAL=data/eval\n",
273 |     "\n",
274 |     "!python training_script/cifar10_keras_sm.py --model_dir ./logs --epochs 1\n",
275 |     "!rm -rf logs"
276 |    ]
277 |   },
278 |   {
279 |    "cell_type": "markdown",
280 |    "metadata": {},
281 |    "source": [
282 |     "### Use SageMaker local for local testing\n",
283 |     "\n",
284 |     "본격적으로 학습을 시작하기 전에 로컬 모드를 사용하여 디버깅을 먼저 수행합니다. 로컬 모드는 학습 인스턴스를 생성하는 과정이 없이 로컬 인스턴스로 컨테이너를 가져온 후 곧바로 학습을 수행하기 때문에 코드를 보다 신속히 검증할 수 있습니다.\n",
285 |     "\n",
286 |     "Amazon SageMaker Python SDK의 로컬 모드는 TensorFlow 또는 MXNet estimator서 단일 인자값을 변경하여 CPU (단일 및 다중 인스턴스) 및 GPU (단일 인스턴스) SageMaker 학습 작업을 에뮬레이션(enumlate)할 수 있습니다. 이를 위해 Docker compose와 NVIDIA Docker를 사용합니다.\n",
287 |     "학습 작업을 시작하기 위해 `estimator.fit() ` 호출 시, Amazon ECS에서 Amazon SageMaker TensorFlow 컨테이너를 로컬 노트북 인스턴스로 다운로드합니다.\n",
288 |     "\n",
289 |     "로컬 모드의 학습을 통해 여러분의 코드가 현재 사용 중인 하드웨어를 적절히 활용하고 있는지 확인하기 위한 GPU 점유와 같은 지표(metric)를 쉽게 모니터링할 수 있습니다."
290 |    ]
291 |   },
292 |   {
293 |    "cell_type": "code",
294 |    "execution_count": null,
295 |    "metadata": {},
296 |    "outputs": [],
297 |    "source": [
298 |     "import os\n",
299 |     "import sagemaker\n",
300 |     "from sagemaker import get_execution_role\n",
301 |     "\n",
302 |     "sagemaker_session = sagemaker.Session()\n",
303 |     "\n",
304 |     "role = get_execution_role()"
305 |    ]
306 |   },
307 |   {
308 |    "cell_type": "markdown",
309 |    "metadata": {},
310 |    "source": [
311 |     "`sagemaker.tensorflow` 클래스를 사용하여 SageMaker Python SDK의 Tensorflow Estimator 인스턴스를 생성합니다.\n",
312 |     "인자값으로 하이퍼파라메터와 다양한 설정들을 변경할 수 있습니다.\n",
313 |     "\n",
314 |     "자세한 내용은 [documentation](https://sagemaker.readthedocs.io/en/stable/using_tf.html#training-with-tensorflow-estimator)을 확인하시기 바랍니다."
315 |    ]
316 |   },
317 |   {
318 |    "cell_type": "code",
319 |    "execution_count": null,
320 |    "metadata": {},
321 |    "outputs": [],
322 |    "source": [
323 |     "from sagemaker.tensorflow import TensorFlow\n",
324 |     "estimator = TensorFlow(base_job_name='cifar10',\n",
325 |     "                       entry_point='cifar10_keras_sm.py',\n",
326 |     "                       source_dir='training_script',\n",
327 |     "                       role=role,\n",
328 |     "                       framework_version='1.14.0',\n",
329 |     "                       py_version='py3',\n",
330 |     "                       script_mode=True,\n",
331 |     "                       hyperparameters={'epochs' : 1},\n",
332 |     "                       train_instance_count=1, \n",
333 |     "                       train_instance_type='local')"
334 |    ]
335 |   },
336 |   {
337 |    "cell_type": "markdown",
338 |    "metadata": {},
339 |    "source": [
340 |     "학습을 수행할 3개의 채널과 데이터의 경로를 지정합니다. **로컬 모드로 수행하기 때문에 S3 경로 대신 노트북 인스턴스의 경로를 지정하시면 됩니다.**"
341 |    ]
342 |   },
343 |   {
344 |    "cell_type": "code",
345 |    "execution_count": null,
346 |    "metadata": {},
347 |    "outputs": [],
348 |    "source": [
349 |     "%%time\n",
350 |     "estimator.fit({'train': 'file://data/train',\n",
351 |     "               'validation': 'file://data/validation',\n",
352 |     "               'eval': 'file://data/eval'})"
353 |    ]
354 |   },
355 |   {
356 |    "cell_type": "markdown",
357 |    "metadata": {},
358 |    "source": [
359 |     "Estimator가 처음 실행될 때 Amazon ECR 리포지토리(repository)에서 컨테이너 이미지를 다운로드해야 하지만 학습을 즉시 시작할 수 있습니다. 즉, 별도의 학습 클러스터가 프로비저닝 될 때까지 기다릴 필요가 없습니다. 또한 반복 및 테스트시 필요할 수 있는 후속 실행에서 MXNet 또는 TensorFlow 스크립트에 대한 수정 사항이 즉시 실행되기 시작합니다."
360 |    ]
361 |   },
362 |   {
363 |    "cell_type": "markdown",
364 |    "metadata": {},
365 |    "source": [
366 |     "### Using SageMaker for faster training time\n",
367 |     "\n",
368 |     "이번에는 로컬 모드를 사용하지 않고 SageMaker 학습에 GPU 학습 인스턴스를 생성하여 학습 시간을 단축해 봅니다.<br>\n",
369 |     "로컬 모드와 다른 점들은 (1) `train_instance_type`이 로컬 모드의 ‘local’ 대신 여러분이 원하는 특정 인스턴스 유형으로 설정해야 하고, (2) 학습 데이터를 Amazon S3에 업로드 후 학습 경로를 S3 경로로 설정해야 합니다. \n",
370 |     "\n",
371 |     "SageMaker SDK는 S3 업로드를 위한 간단한 함수(`Session.upload_data()`)를 제공합니다. 이 함수를 통해 리턴되는 값은 데이터가 저장된 S3 경로입니다.\n",
372 |     "좀 더 자세한 설정이 필요하다면 SageMaker SDK 대신 boto3를 사용하시면 됩니다.\n",
373 |     "\n",
374 |     "*[Note]: 고성능 워크로드를 위해 Amazon EFS와 Amazon FSx for Lustre도 지원하고 있습니다. 자세한 정보는 아래의 AWS 블로그를 참조해 주세요.<br>\n",
375 |     "https://aws.amazon.com/blogs/machine-learning/speed-up-training-on-amazon-sagemaker-using-amazon-efs-or-amazon-fsx-for-lustre-file-systems/*"
376 |    ]
377 |   },
378 |   {
379 |    "cell_type": "code",
380 |    "execution_count": null,
381 |    "metadata": {},
382 |    "outputs": [],
383 |    "source": [
384 |     "dataset_location = sagemaker_session.upload_data(path='data', key_prefix='data/DEMO-cifar10')\n",
385 |     "display(dataset_location)"
386 |    ]
387 |   },
388 |   {
389 |    "cell_type": "markdown",
390 |    "metadata": {},
391 |    "source": [
392 |     "S3에 데이터 업로드를 완료했다면, Estimator를 새로 생성합니다. <br>\n",
393 |     "아래 코드를 그대로 복사 후에 `train_instance_type='local'`을 `train_instance_type='ml.p2.xlarge'`로 수정하고\n",
394 |     "`hyperparameters={'epochs': 1}`를 `hyperparameters={'epochs': 5}`로 수정합니다.\n",
395 |     "\n",
396 |     "```python\n",
397 |     "from sagemaker.tensorflow import TensorFlow\n",
398 |     "estimator = TensorFlow(base_job_name='cifar10',\n",
399 |     "                       entry_point='cifar10_keras_sm.py',\n",
400 |     "                       source_dir='training_script',\n",
401 |     "                       role=role,\n",
402 |     "                       framework_version='1.14.0',\n",
403 |     "                       py_version='py3',\n",
404 |     "                       script_mode=True,                       \n",
405 |     "                       hyperparameters={'epochs': 1},\n",
406 |     "                       train_instance_count=1, \n",
407 |     "                       train_instance_type='local')\n",
408 |     "```\n",
409 |     "\n",
410 |     "*[Note] \n",
411 |     "2019년 8월부터 SageMaker에서도 학습 인스턴스에 EC2 spot instance를 사용하여 비용을 크게 절감할 수 있습니다. 자세한 정보는 아래의 AWS 블로그를 참조해 주세요.<br>\n",
412 |     "https://aws.amazon.com/ko/blogs/korea/managed-spot-training-save-up-to-90-on-your-amazon-sagemaker-training-jobs/*\n",
413 |     "\n",
414 |     "만약 Managed Spot Instance로 학습하려면 다음 코드를 Estimator의 train_instance_type의 다음 행에 추가해 주세요.\n",
415 |     "```python\n",
416 |     "train_max_run = 3600,\n",
417 |     "train_use_spot_instances = 'True',\n",
418 |     "train_max_wait = 3600,\n",
419 |     "```"
420 |    ]
421 |   },
422 |   {
423 |    "cell_type": "code",
424 |    "execution_count": null,
425 |    "metadata": {},
426 |    "outputs": [],
427 |    "source": [
428 |     "from sagemaker.tensorflow import TensorFlow\n",
429 |     "estimator = TensorFlow(base_job_name='cifar10',\n",
430 |     "                       entry_point='cifar10_keras_sm.py',\n",
431 |     "                       source_dir='training_script',\n",
432 |     "                       role=role,\n",
433 |     "                       framework_version='1.14.0',\n",
434 |     "                       py_version='py3',\n",
435 |     "                       script_mode=True,                       \n",
436 |     "                       hyperparameters={'epochs': 5},\n",
437 |     "                       train_instance_count=1, \n",
438 |     "                       train_instance_type='ml.p2.xlarge')"
439 |    ]
440 |   },
441 |   {
442 |    "cell_type": "markdown",
443 |    "metadata": {},
444 |    "source": [
445 |     "학습을 수행합니다. 이번에는 각각의 채널(`train, validation, eval`)에 S3의 데이터 저장 위치를 지정합니다.<br>\n",
446 |     "학습 완료 후 Billable seconds도 확인해 보세요. Billable seconds는 실제로 학습 수행 시 과금되는 시간입니다.\n",
447 |     "```\n",
448 |     "Billable seconds: <time>\n",
449 |     "```\n",
450 |     "\n",
451 |     "참고로, `ml.p2.xlarge` 인스턴스로 5 epoch 학습 시 전체 6분~7분이 소요되고, 실제 학습에 소요되는 시간은 3분~4분이 소요됩니다."
452 |    ]
453 |   },
454 |   {
455 |    "cell_type": "code",
456 |    "execution_count": null,
457 |    "metadata": {},
458 |    "outputs": [],
459 |    "source": [
460 |     "%%time\n",
461 |     "estimator.fit({'train':'{}/train'.format(dataset_location),\n",
462 |     "              'validation':'{}/validation'.format(dataset_location),\n",
463 |     "              'eval':'{}/eval'.format(dataset_location)})"
464 |    ]
465 |   },
466 |   {
467 |    "cell_type": "markdown",
468 |    "metadata": {},
469 |    "source": [
470 |     "## Start a new SageMaker experiment\n",
471 |     "\n",
472 |     "Amazon SageMaker Experiments는 데이타 과학자들이 머신 러닝 실험을 구성하고, 추적하고, 비교하고, 평가할 수 있게 합니다.\n",
473 |     "머신 러닝은 반복적인 과정 입니다. 데이타 과학자들은 증분적인 모델 정확도의 변화를 관찰하면서, 데이타, 알고리즘, 파라미터의 조합들을 가지고 실험을 할 필요가 있습니다. 이러한 반복적인 과정은 수 많은 모델 훈련 및 모델의 버전들을 가지게 됩니다. 이것은 성능이 좋은 모델들 및 입력 설정의 구성들을 추적하기가 어렵게 됩니다. 이것은 더욱 더 증분적인 향상을 위한 기회를 찾기 위해서, 현재의 실험들과 과거에 수행한 실험들의 비교를 더욱 더 어렵게 합니다. \n",
474 |     "\n",
475 |     "**Amazon SageMaker Experiments는 반복적인 과정(시험, Trial)으로서의 입력 값들, 파라미터들, 구성 설정 값들 및 결과들을 자동으로 추적 할 수 있게 합니다.<br>\n",
476 |     "데이타 과학자들은 시험들(Trials)을 실험(Experiment) 안으로 할당하고, 그룹핑하고, 구성할 수 있습니다.**\n",
477 |     "Amazon SageMaker Experiments는 현재 및 과거의 실험들을 시각적으로 조회할 수 있게 하는 Amazon SageMaker Studio와 통합이 되어 있습니다. Amazon SageMaker Studio는 또한 주요 평가 지표를 가지고 시험들을 비교할 수 있으며, 가장 우수한 모델들을 확인할 수 있게 합니다.  \n"
478 |    ]
479 |   },
480 |   {
481 |    "cell_type": "markdown",
482 |    "metadata": {},
483 |    "source": [
484 |     "sagemaker-experiments 를 먼저 설치 합니다."
485 |    ]
486 |   },
487 |   {
488 |    "cell_type": "code",
489 |    "execution_count": null,
490 |    "metadata": {},
491 |    "outputs": [],
492 |    "source": [
493 |     "!pip install sagemaker-experiments"
494 |    ]
495 |   },
496 |   {
497 |    "cell_type": "markdown",
498 |    "metadata": {},
499 |    "source": [
500 |     "이제 실험(Experiment)을 만듭니다."
501 |    ]
502 |   },
503 |   {
504 |    "cell_type": "code",
505 |    "execution_count": null,
506 |    "metadata": {},
507 |    "outputs": [],
508 |    "source": [
509 |     "from smexperiments.experiment import Experiment\n",
510 |     "from smexperiments.trial import Trial\n",
511 |     "import time\n",
512 |     "\n",
513 |     "# Create an aexperiment\n",
514 |     "cifar10_experiment = Experiment.create(\n",
515 |     "    experiment_name=\"TensorFlow-cifar10-experiment\",\n",
516 |     "    description=\"Classification of cifar10 images\")"
517 |    ]
518 |   },
519 |   {
520 |    "cell_type": "markdown",
521 |    "metadata": {},
522 |    "source": [
523 |     "다음은 시험(Trial)을 생성 합니다. 이 시험은 GPU Instance 위에서 Epoch 5를 가지고 실행하게 됩니다."
524 |    ]
525 |   },
526 |   {
527 |    "cell_type": "code",
528 |    "execution_count": null,
529 |    "metadata": {},
530 |    "outputs": [],
531 |    "source": [
532 |     "# Create a trial\n",
533 |     "trial_name = f\"cifar10-training-job-{int(time.time())}\"\n",
534 |     "trial = Trial.create(\n",
535 |     "    trial_name=trial_name, \n",
536 |     "    experiment_name=cifar10_experiment.experiment_name\n",
537 |     ")"
538 |    ]
539 |   },
540 |   {
541 |    "cell_type": "markdown",
542 |    "metadata": {},
543 |    "source": [
544 |     "새로운 estimator를 생성 합니다."
545 |    ]
546 |   },
547 |   {
548 |    "cell_type": "code",
549 |    "execution_count": null,
550 |    "metadata": {},
551 |    "outputs": [],
552 |    "source": [
553 |     "from sagemaker.tensorflow import TensorFlow\n",
554 |     "estimator = TensorFlow(base_job_name='cifar10',\n",
555 |     "                       entry_point='cifar10_keras_sm_solution.py',\n",
556 |     "                       source_dir='training_script',\n",
557 |     "                       role=role,\n",
558 |     "                       framework_version='1.14.0',\n",
559 |     "                       py_version='py3',\n",
560 |     "                       hyperparameters={'epochs' : 5},\n",
561 |     "                       train_instance_count=1, \n",
562 |     "                       train_instance_type='ml.p2.xlarge')"
563 |    ]
564 |   },
565 |   {
566 |    "cell_type": "markdown",
567 |    "metadata": {},
568 |    "source": [
569 |     "다음은 각각 입력 데이타의 채널에 대한 S3 data location을 사용 합니다.\n",
570 |     "```python\n",
571 |     "dataset_location + '/train'\n",
572 |     "dataset_location + '/validation' \n",
573 |     "dataset_location + '/eval'\n",
574 |     "```\n",
575 |     "위에서 설정한 experiment config를 fit 함수의 파라미터로 추가 합니다. 또한 시험은 훈련 Job과 연결이 됩니다.\n",
576 |     "<br>TrialComponent는 시험(Trail)의 한 요소를 의미 합니다. 여기서는 \"Training\"의 훈련 요소를 지칭 합니다.\n",
577 |     "```python\n",
578 |     "experiment_config={\n",
579 |     "                  \"ExperimentName\": cifar10_experiment.experiment_name, \n",
580 |     "                  \"TrialName\": trial.trial_name,\n",
581 |     "                  \"TrialComponentDisplayName\": \"Training\"}\n",
582 |     "```"
583 |    ]
584 |   },
585 |   {
586 |    "cell_type": "code",
587 |    "execution_count": null,
588 |    "metadata": {},
589 |    "outputs": [],
590 |    "source": [
591 |     "estimator.fit({'train' :  dataset_location + '/train',\n",
592 |     "               'validation' :  dataset_location + '/validation',\n",
593 |     "               'eval' :  dataset_location + '/eval'\n",
594 |     "              },\n",
595 |     "              experiment_config={\n",
596 |     "                \"ExperimentName\": cifar10_experiment.experiment_name, \n",
597 |     "                \"TrialName\": trial.trial_name,\n",
598 |     "                \"TrialComponentDisplayName\": \"Training\"\n",
599 |     "              }\n",
600 |     "            )"
601 |    ]
602 |   },
603 |   {
604 |    "cell_type": "markdown",
605 |    "metadata": {},
606 |    "source": [
607 |     "## Analyze the experiments"
608 |    ]
609 |   },
610 |   {
611 |    "cell_type": "markdown",
612 |    "metadata": {},
613 |    "source": [
614 |     "여기서는 DisplayName 이 \"Training\"과 같은 시험 요소(Trial Component)만 찾는 필터를 생성 합니다. <br>\n",
615 |     "위에서 설정한 TrialComponentDisplayName\": \"Training\" 을 찾게 됩니다."
616 |    ]
617 |   },
618 |   {
619 |    "cell_type": "code",
620 |    "execution_count": null,
621 |    "metadata": {},
622 |    "outputs": [],
623 |    "source": [
624 |     "search_expression = {\n",
625 |     "    \"Filters\":[\n",
626 |     "        {\n",
627 |     "            \"Name\": \"DisplayName\",\n",
628 |     "            \"Operator\": \"Equals\",\n",
629 |     "            \"Value\": \"Training\",\n",
630 |     "        }\n",
631 |     "    ],\n",
632 |     "}"
633 |    ]
634 |   },
635 |   {
636 |    "cell_type": "markdown",
637 |    "metadata": {},
638 |    "source": [
639 |     "ExperimentAnalytics 함수에 experiment 이름과 위에서 생성한 필터를 파라미터로 제공 합니다."
640 |    ]
641 |   },
642 |   {
643 |    "cell_type": "code",
644 |    "execution_count": null,
645 |    "metadata": {},
646 |    "outputs": [],
647 |    "source": [
648 |     "import pandas as pd \n",
649 |     "pd.options.display.max_columns = 500\n",
650 |     "\n",
651 |     "from sagemaker.analytics import ExperimentAnalytics\n",
652 |     "trial_component_analytics = ExperimentAnalytics(\n",
653 |     "    sagemaker_session=sagemaker_session, \n",
654 |     "    experiment_name=cifar10_experiment.experiment_name,\n",
655 |     "    search_expression=search_expression\n",
656 |     ")\n",
657 |     "\n",
658 |     "table = trial_component_analytics.dataframe(force_refresh=True)\n",
659 |     "display(table)"
660 |    ]
661 |   },
662 |   {
663 |    "cell_type": "markdown",
664 |    "metadata": {},
665 |    "source": [
666 |     "### Clenn up the Experiment\n",
667 |     "experiment 이름은 계정과 리젼에 유니크한 이름이기에, 사용을 하지 않는다면 지워주는 것이 좋습니다.<br>\n",
668 |     "위에서 생성한 cifar10_experiment 오브젝트를 아래 cleanup 함수에 파라미터로 주어서 지워주게 됩니다.\n",
669 |     "이 작업은 관련된 Trial Component, Trial 을 지우고, 마지막으로 experiment를 삭제 합니다."
670 |    ]
671 |   },
672 |   {
673 |    "cell_type": "code",
674 |    "execution_count": null,
675 |    "metadata": {},
676 |    "outputs": [],
677 |    "source": [
678 |     "import boto3\n",
679 |     "\n",
680 |     "sess = boto3.Session()\n",
681 |     "sm = sess.client('sagemaker')\n",
682 |     "from smexperiments.trial_component import TrialComponent\n",
683 |     "\n",
684 |     "def cleanup(experiment):\n",
685 |     "    for trial_summary in experiment.list_trials():\n",
686 |     "        trial = Trial.load(sagemaker_boto_client=sm, trial_name=trial_summary.trial_name)\n",
687 |     "        for trial_component_summary in trial.list_trial_components():\n",
688 |     "            tc = TrialComponent.load(\n",
689 |     "                sagemaker_boto_client=sm,\n",
690 |     "                trial_component_name=trial_component_summary.trial_component_name)\n",
691 |     "            trial.remove_trial_component(tc)\n",
692 |     "            try:\n",
693 |     "                # comment out to keep trial components\n",
694 |     "                tc.delete()\n",
695 |     "            except:\n",
696 |     "                # tc is associated with another trial\n",
697 |     "                continue\n",
698 |     "            # to prevent throttling\n",
699 |     "            time.sleep(.5)\n",
700 |     "        trial.delete()\n",
701 |     "    experiment.delete()\n",
702 |     "    print(\"The experiemnt is deleted\")\n",
703 |     "\n",
704 |     "\n",
705 |     "cleanup(cifar10_experiment)    "
706 |    ]
707 |   },
708 |   {
709 |    "cell_type": "markdown",
710 |    "metadata": {},
711 |    "source": [
712 |     "**잘 하셨습니다.** \n",
713 |     "\n",
714 |     "SageMaker에서 GPU 인스턴스를 사용해 5 epoch를 정상적으로 학습할 수 있었습니다.<br>\n",
715 |     "다음 노트북으로 계속 진행하기 전에 SageMaker 콘솔의 Training jobs 섹션을 살펴보고 여러분이 수행한 job을 찾아 configuration을 확인하세요.\n",
716 |     "\n",
717 |     "스크립트 모드 학습에 대한 자세한 내용은 아래의 AWS 블로그를 참조해 주세요.<br>\n",
718 |     "[Using TensorFlow eager execution with Amazon SageMaker script mode](https://aws.amazon.com/ko/blogs/machine-learning/using-tensorflow-eager-execution-with-amazon-sagemaker-script-mode/)"
719 |    ]
720 |   }
721 |  ],
722 |  "metadata": {
723 |   "kernelspec": {
724 |    "display_name": "conda_tensorflow_p36",
725 |    "language": "python",
726 |    "name": "conda_tensorflow_p36"
727 |   },
728 |   "language_info": {
729 |    "codemirror_mode": {
730 |     "name": "ipython",
731 |     "version": 3
732 |    },
733 |    "file_extension": ".py",
734 |    "mimetype": "text/x-python",
735 |    "name": "python",
736 |    "nbconvert_exporter": "python",
737 |    "pygments_lexer": "ipython3",
738 |    "version": "3.6.5"
739 |   },
740 |   "toc-autonumbering": false,
741 |   "toc-showcode": false,
742 |   "toc-showmarkdowntxt": false,
743 |   "toc-showtags": false
744 |  },
745 |  "nbformat": 4,
746 |  "nbformat_minor": 4
747 | }
748 | 


--------------------------------------------------------------------------------
/0_Running_TensorFlow_In_SageMaker_tf2.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# [Module 1] Train a Keras Sequential Model (TensorFlow 2.0)\n",
  8 |     "\n",
  9 |     "### [Note] 본 주피터 노트북은 TensorFlow 2.0에서 핸즈온을 수행합니다. Amazon SageMaker는 2020년 1월부터 빌트인 딥러닝 컨테이너 형태로 TensorFlow 2.0을 지원하고 있습니다.\n",
 10 |     "\n",
 11 |     "본 노트북(notebook)은 SageMaker 상에서 Keras Sequential model을 학습하는 방법을 단계별로 설명합니다. 본 노트북에서 사용한 모델은 간단한 deep CNN(Convolutional Neural Network) 모델로 [the Keras examples](https://github.com/keras-team/keras/blob/master/examples/cifar10_cnn.py)에 소개된 모델과 동일합니다.\n",
 12 |     "- 참고로, 본 모델은 25 epoch 학습 후에 검증셋의 정확도(accuracy)가 약 75%이고 50 epoch 학습 후에 검증셋의 정확도가 약 79% 입니다.\n",
 13 |     "- 본 워크샵 과정에서는 시간 관계상 5 epoch까지만 학습합니다. (단, Horovod 기반 분산 학습은 10 epoch까지 학습합니다.)"
 14 |    ]
 15 |   },
 16 |   {
 17 |    "cell_type": "markdown",
 18 |    "metadata": {},
 19 |    "source": [
 20 |     "## The dataset\n",
 21 |     "\n",
 22 |     "[CIFAR-10 dataset](https://www.cs.toronto.edu/~kriz/cifar.html)은 머신 러닝에서 가장 유명한 데이터셋 중 하나입니다.\n",
 23 |     "이 데이터셋은 10개의 다른 클래스로 구성된(클래스당 6,000장) 60,000장의 32x32 픽셀 이미지들로 구성되어 있습니다.\n",
 24 |     "아래 그림은 클래스당 10장의 이미지들을 랜덤으로 추출한 결과입니다. \n",
 25 |     "\n",
 26 |     "![cifar10](https://maet3608.github.io/nuts-ml/_images/cifar10.png)\n",
 27 |     "\n",
 28 |     "본 실습에서 여러분들은 deep CNN을 학습하여 영상 분류(image classification) 작업을 수행합니다. 다음 노트북들에서\n",
 29 |     "여러분들은 File Mode, Pipe Mode와 Horovod 기반 분산 학습(distributed training) 결과를 비교할 것입니다."
 30 |    ]
 31 |   },
 32 |   {
 33 |    "cell_type": "markdown",
 34 |    "metadata": {},
 35 |    "source": [
 36 |     "## Getting the data\n",
 37 |     "아래 AWS CLI(Command Line Interface) 커맨드를 사용하여 S3(Amazon Simple Storage Service)에 저장된 TFRecord 데이터셋을 여러분의 로컬 노트북 인스턴스로 복사합니다.\n",
 38 |     "S3 경로는 `s3://floor28/data/cifar10` 입니다. \n",
 39 |     "\n",
 40 |     "### TFRecord는 무엇인가요?\n",
 41 |     "- Google에서 Tensorflow backend로 모델링 시에 공식적으로 권장하는 binary 포맷입니다.\n",
 42 |     "- Tensorflow의 protocol buffer 파일로 직렬화된 입력 데이터가 담겨 있습니다.\n",
 43 |     "- 대용량 데이터를 멀티스레딩으로 빠르게 스트리밍할 때 유용합니다. (모든 데이터는 메모리의 하나의 블록에 저장되므로, 입력 파일이 개별로 저장된 경우보다 데이터 로딩에 필요한 시간이 대폭 단축됩니다.)\n",
 44 |     "- Example 객체로 구성된 배열의 집합체입니다. (an array of Examples)\n",
 45 |     "- 아래 그림은 $m$차원 feautre가 $n$개의 샘플로 구성된 TFRecord 예시입니다.\n",
 46 |     "\n",
 47 |     "![TFRecord](./images/TFRecord.png)"
 48 |    ]
 49 |   },
 50 |   {
 51 |    "cell_type": "code",
 52 |    "execution_count": null,
 53 |    "metadata": {},
 54 |    "outputs": [],
 55 |    "source": [
 56 |     "!pip install tensorflow==2.0.0"
 57 |    ]
 58 |   },
 59 |   {
 60 |    "cell_type": "code",
 61 |    "execution_count": null,
 62 |    "metadata": {},
 63 |    "outputs": [],
 64 |    "source": [
 65 |     "import tensorflow as tf\n",
 66 |     "import numpy as np\n",
 67 |     "print(\"Num GPUs Available: \", len(tf.config.experimental.list_physical_devices('GPU')))"
 68 |    ]
 69 |   },
 70 |   {
 71 |    "cell_type": "code",
 72 |    "execution_count": null,
 73 |    "metadata": {},
 74 |    "outputs": [],
 75 |    "source": [
 76 |     "!aws s3 cp --recursive s3://floor28/data/cifar10 ./data"
 77 |    ]
 78 |   },
 79 |   {
 80 |    "cell_type": "markdown",
 81 |    "metadata": {},
 82 |    "source": [
 83 |     "## Run the training locally"
 84 |    ]
 85 |   },
 86 |   {
 87 |    "cell_type": "markdown",
 88 |    "metadata": {},
 89 |    "source": [
 90 |     "본 스크립트는 모델 학습에 필요한 인자값(arguments)들을 사용합니다. 모델 학습에 필요한 인자값들은 아래와 같습니다.\n",
 91 |     "\n",
 92 |     "1. `model_dir` - 로그와 체크 포인트를 저장하는 경로\n",
 93 |     "2. `train, validation, eval` - TFRecord 데이터셋을 저장하는 경로\n",
 94 |     "3. `epochs` - epoch 횟수\n",
 95 |     "\n",
 96 |     "아래 명령어로 **<font color='red'>SageMaker 관련 API 호출 없이</font>** 로컬 노트북 인스턴스 환경에서 1 epoch만 학습해 봅니다. 참고로, MacBook Pro(15-inch, 2018) 2.6GHz Core i7 16GB 사양에서 2분 20초~2분 40초 소요됩니다."
 97 |    ]
 98 |   },
 99 |   {
100 |    "cell_type": "code",
101 |    "execution_count": null,
102 |    "metadata": {},
103 |    "outputs": [],
104 |    "source": [
105 |     "%%time\n",
106 |     "!mkdir -p logs\n",
107 |     "!python training_script/cifar10_keras_tf2.py --model_dir ./logs \\\n",
108 |     "                                         --train data/train \\\n",
109 |     "                                         --validation data/validation \\\n",
110 |     "                                         --eval data/eval \\\n",
111 |     "                                         --epochs 1\n",
112 |     "!rm -rf logs"
113 |    ]
114 |   },
115 |   {
116 |    "cell_type": "markdown",
117 |    "metadata": {},
118 |    "source": [
119 |     "**<font color='blue'>본 스크립트는 SageMaker상의 notebook에서 구동하고 있지만, 여러분의 로컬 컴퓨터에서도 python과 jupyter notebook이 정상적으로 인스톨되어 있다면 동일하게 수행 가능합니다.</font>**"
120 |    ]
121 |   },
122 |   {
123 |    "cell_type": "markdown",
124 |    "metadata": {},
125 |    "source": [
126 |     "## Use TensorFlow Script Mode\n",
127 |     "\n",
128 |     "TensorFlow 버전 1.11 이상에서 Amazon SageMaker Python SDK는 **스크립트 모드(Script mode)**를 지원합니다. 스크립트 모드는 종래 레거시 모드(Legacy mode) 대비 아래 장점들이 있습니다.\n",
129 |     "\n",
130 |     "* 스크립트 모드의 학습 스크립트는 일반적으로 TensorFlow 용으로 작성하는 학습 스크립트와 더 유사하므로 TensorFlow 학습 스크립트를 최소한의 변경으로 실행할 수 있습니다. 따라서, 기존 레거시 모드보다 TensorFlow 학습 스크립트를 수정하는 것이 더 쉽습니다. \n",
131 |     "    - 레거시 모드는 Tensorflow Estimator API를 기반으로 한 아래의 함수들을 반드시 포함해야 합니다.\n",
132 |     "        - 아래 함수들에서 하나의 함수를 만드시 포함해야 합니다.\n",
133 |     "            - `model_fn`: 학습할 모델을 정의합니다,\n",
134 |     "            - `keras_model_fn`: 학습할 tf.keras 모델을 정의합니다.\n",
135 |     "            - `estimator_fn`: 학습할 tf.estimator.Estimator를 정의합니다.\n",
136 |     "        - `train_input_fn`: 학습 데이터 로딩과 전처리를 수행합니다. \n",
137 |     "        - `eval_input_fn`: 검증 데이터의 로딩과 전처리를 수행합니다.\n",
138 |     "        - (Optional) `serving_input_fn`: 예측(prediction) 중에 모델에 전달할 feautre를 정의합니다. 이 함수는 학습시에만 사용되지만, SageMaker 엔드포인트에서 모델을 배포할 때 필요합니다.\n",
139 |     "    - `if __name__ == “__main__”:` 블록을 정의할 수 없어 디버깅이 쉽지 않습니다.\n",
140 |     "    \n",
141 |     "* 스크립트 모드는 Python 2.7-와 Python 3.6-을 지원합니다.\n",
142 |     "\n",
143 |     "* 스크립트 모드는 **Hovorod 기반 분산 학습(distributed training)도 지원**합니다.\n",
144 |     "\n",
145 |     "TensorFlow 스크립트 모드에서 학습 스크립트를 작성하는 방법 및 Tensorflow 스크립트 모드의 estimator와 model 사용법에 대한 자세한 내용은\n",
146 |     "https://sagemaker.readthedocs.io/en/stable/using_tf.html 을 참조하세요."
147 |    ]
148 |   },
149 |   {
150 |    "cell_type": "markdown",
151 |    "metadata": {},
152 |    "source": [
153 |     "### Preparing your script for training in SageMaker\n",
154 |     "\n",
155 |     "SageMaker 스크립트 모드의 학습 스크립트는 SageMaker 외부에서 실행할 수 있는 학습 스크립트와 매우 유사합니다.\n",
156 |     "SageMaker는 하나의 인자값(argument), model_dir와 로그 및 모델 아티팩트(model artifacts)에 사용되는 S3 경로로 학습 스크립트를 실행합니다.\n",
157 |     "\n",
158 |     "SageMaker 학습 인스턴스에서는 학습의 컨테이너에 S3에 저장된 데이터를 다운로드하여 학습에 활용합니다. 그 때, S3 버킷의 데이터 경로와 컨테이너의 데이터 경로를 컨테이너 환경 변수를 통해 연결합니다.\n",
159 |     "\n",
160 |     "여러분은 다양한 환경 변수를 통해 학습 환경에 대한 유용한 속성들(properties)에 액세스할 수 있습니다.\n",
161 |     "이 스크립트의 경우 `Train, Validation, Eval`이라는 3 개의 데이터 채널을 스크립트로 보냅니다.\n",
162 |     "\n",
163 |     "**`training_script/cifar10_keras_tf2.py`에서 스크립트 사본을 생성 후, `training_script/cifar10_keras_sm_tf2.py`로 저장하세요.**\n",
164 |     "\n",
165 |     "스크립트 사본을 생성하였다면 단계별로 아래의 작업들을 직접 시도합니다.\n",
166 |     "\n",
167 |     "----\n",
168 |     "### TODO 1.\n",
169 |     "`cifar10_keras_sm_tf2.py`파일에서 SageMaker API 환경 변수 SM_CHANNEL_TRAIN, SM_CHANNEL_VALIDATION, SM_CHANNEL_EVAL에서 디폴트 값을 가져오기 위해 train, validation, eval 인수를 수정해 주세요. \n",
170 |     "\n",
171 |     "`cifar10_keras_sm_tf2.py`의 `if __name__ == '__main__':` 블록 내에 아래 인자값을 수정해 주세요.\n",
172 |     "\n",
173 |     "```python\n",
174 |     "parser.add_argument(\n",
175 |     "        '--train',\n",
176 |     "        type=str,\n",
177 |     "        required=False,\n",
178 |     "        default=os.environ.get('SM_CHANNEL_TRAIN'), # <-- 수정 부분\n",
179 |     "        help='The directory where the CIFAR-10 input data is stored.')\n",
180 |     "parser.add_argument(\n",
181 |     "        '--validation',\n",
182 |     "        type=str,\n",
183 |     "        required=False,\n",
184 |     "        default=os.environ.get('SM_CHANNEL_VALIDATION'), # <-- 수정 부분\n",
185 |     "        help='The directory where the CIFAR-10 input data is stored.')\n",
186 |     "parser.add_argument(\n",
187 |     "        '--eval',\n",
188 |     "        type=str,\n",
189 |     "        required=False,\n",
190 |     "        default=os.environ.get('SM_CHANNEL_EVAL'), # <-- 수정 부분\n",
191 |     "        help='The directory where the CIFAR-10 input data is stored.')\n",
192 |     "```\n",
193 |     "\n",
194 |     "\n",
195 |     "환경 변수에 따른 S3 경로와 컨테이너 경로는 아래 표와 같습니다.\n",
196 |     "\n",
197 |     "|  S3 경로  |  환경 변수  |  컨테이너 경로  |\n",
198 |     "| :---- | :---- | :----| \n",
199 |     "|  s3://bucket_name/prefix/train  |  `SM_CHANNEL_TRAIN`  | `/opt/ml/input/data/train`  |\n",
200 |     "|  s3://bucket_name/prefix/validation  |  `SM_CHANNEL_VALIDATION`  | `/opt/ml/input/data/validation`  |\n",
201 |     "|  s3://bucket_name/prefix/eval  |  `SM_CHANNEL_EVAL`  | `/opt/ml/input/data/eval`  |\n",
202 |     "|  s3://bucket_name/prefix/model.tar.gz  |  `SM_MODEL_DIR`  |  `/opt/ml/model`  |\n",
203 |     "|  s3://bucket_name/prefix/output.tar.gz  |  `SM_OUTPUT_DATA_DIR`  |  `/opt/ml/output/data`  |\n",
204 |     "\n",
205 |     "얘를 들어, `/opt/ml/input/data/train`은 학습 데이터가 다운로드되는 컨테이너 내부의 디렉토리입니다.\n",
206 |     "\n",
207 |     "자세한 내용은 아래의 SageMaker Python SDK 문서를 확인하시기 바랍니다.<br>\n",
208 |     "(https://sagemaker.readthedocs.io/en/stable/using_tf.html#preparing-a-script-mode-training-script)\n",
209 |     "\n",
210 |     "\n",
211 |     "SageMaker는 train, validation, eval 경로들을 직접 인자로 보내지 않고, 대신 스크립트에서 환경 변수를 사용하여 해당 인자를 필요하지 않은 것으로 표시합니다.\n",
212 |     "\n",
213 |     "SageMaker는 유용한 환경 변수를 여러분이 작성한 학습 스크립트로 보냅니다. 예시들은 아래와 같습니다.\n",
214 |     "* `SM_MODEL_DIR`: 학습 작업이 모델 아티팩트(model artifacts)를 저장할 수 있는 로컬 경로를 나타내는 문자열입니다. 학습 완료 후, 해당 경로 내 모델 아티팩트는 모델 호스팅을 위해 S3에 업로드됩니다. 이는 S3 위치인 학습 스크립트에 전달 된 model_dir 인수와 다르다는 점을 주의해 주세요. SM_MODEL_DIR은 항상 `/opt/ml/model`로 설정됩니다.\n",
215 |     "* `SM_NUM_GPUS`: 호스트(Host)에서 사용 가능한 GPU 수를 나타내는 정수(integer)입니다.\n",
216 |     "* `SM_OUTPUT_DATA_DIR`: 출력 아티팩트를 저장할 디렉토리의 경로를 나타내는 문자열입니다. 출력 아티팩트에는 체크포인트, 그래프 및 다른 저장용 파일들이 포함될 수 있지만 모델 아티팩트는 포함되지 않습니다. 이 출력 아티팩트들은 압축되어 모델 아티팩트와 동일한 접두사가 있는 S3 버킷으로 S3에 업로드됩니다.\n",
217 |     "\n",
218 |     "이 샘플 코드는 네트워크 지연을 줄이기 위해 모델의 체크포인트(checkpoints)를 로컬 환경에 저장합니다. 이들은 학습 종료 후 S3에 업로드할 수 있습니다.\n",
219 |     "\n",
220 |     "----\n",
221 |     "### TODO 2.\n",
222 |     "\n",
223 |     "`cifar10_keras_sm_tf2.py`의 `if __name__ == '__main__':` 블록 내에 아래 인자값을 추가해 주세요.\n",
224 |     "\n",
225 |     "```python\n",
226 |     "parser.add_argument(\n",
227 |     "        '--model_output_dir',\n",
228 |     "        type=str,\n",
229 |     "        default=os.environ.get('SM_MODEL_DIR'))\n",
230 |     "```\n",
231 |     "\n",
232 |     "----\n",
233 |     "### TODO 3.\n",
234 |     "`ModelCheckpoint` 함수의 저장 경로를 새 경로로 아래와 같이 수정해 주세요.\n",
235 |     "\n",
236 |     "From:\n",
237 |     "```python\n",
238 |     "callbacks.append(ModelCheckpoint(args.model_dir + '/checkpoint-{epoch}.h5'))\n",
239 |     "```\n",
240 |     "To:\n",
241 |     "```python\n",
242 |     "callbacks.append(ModelCheckpoint(args.model_output_dir + '/checkpoint-{epoch}.h5'))\n",
243 |     "```\n",
244 |     "\n",
245 |     "----\n",
246 |     "### TODO 4.\n",
247 |     "`save_model` 함수의 인자값을 아래와 같이 수정해 주세요.\n",
248 |     "\n",
249 |     "From:  \n",
250 |     "```python\n",
251 |     "return save_model(model, args.model_dir)\n",
252 |     "```\n",
253 |     "To:  \n",
254 |     "```python\n",
255 |     "return save_model(model, args.model_output_dir)\n",
256 |     "```\n",
257 |     "\n",
258 |     "<font color='blue'>**본 노트북 실습에 어려움이 있다면 솔루션 파일 `training_script/cifar10_keras_sm_tf2_solution.py`을 참조하시면 됩니다.**</font>"
259 |    ]
260 |   },
261 |   {
262 |    "cell_type": "markdown",
263 |    "metadata": {},
264 |    "source": [
265 |     "### Test your script locally (just like on your laptop)\n",
266 |     "\n",
267 |     "테스트를 위해 위와 동일한 명령(command)으로 새 스크립트를 실행하고, 예상대로 실행되는지 확인합니다. <br>\n",
268 |     "SageMaker TensorFlow API 호출 시에 환경 변수들은 자동으로 넘겨기지만, 로컬 주피터 노트북에서 테스트 시에는 수동으로 환경 변수들을 지정해야 합니다. (아래 예제 코드를 참조해 주세요.)\n",
269 |     "\n",
270 |     "```python\n",
271 |     "%env SM_MODEL_DIR=./logs\n",
272 |     "```"
273 |    ]
274 |   },
275 |   {
276 |    "cell_type": "code",
277 |    "execution_count": null,
278 |    "metadata": {},
279 |    "outputs": [],
280 |    "source": [
281 |     "%%time\n",
282 |     "!mkdir -p logs   \n",
283 |     "\n",
284 |     "# Number of GPUs on this machine\n",
285 |     "%env SM_NUM_GPUS=1\n",
286 |     "# Where to save the model\n",
287 |     "%env SM_MODEL_DIR=./logs\n",
288 |     "# Where the training data is\n",
289 |     "%env SM_CHANNEL_TRAIN=data/train\n",
290 |     "# Where the validation data is\n",
291 |     "%env SM_CHANNEL_VALIDATION=data/validation\n",
292 |     "# Where the evaluation data is\n",
293 |     "%env SM_CHANNEL_EVAL=data/eval\n",
294 |     "\n",
295 |     "!python training_script/cifar10_keras_sm_tf2.py --model_dir ./logs --epochs 1\n",
296 |     "!rm -rf logs"
297 |    ]
298 |   },
299 |   {
300 |    "cell_type": "markdown",
301 |    "metadata": {},
302 |    "source": [
303 |     "### Use SageMaker local for local testing\n",
304 |     "\n",
305 |     "본격적으로 학습을 시작하기 전에 로컬 모드를 사용하여 디버깅을 먼저 수행합니다. 로컬 모드는 학습 인스턴스를 생성하는 과정이 없이 로컬 인스턴스로 컨테이너를 가져온 후 곧바로 학습을 수행하기 때문에 코드를 보다 신속히 검증할 수 있습니다.\n",
306 |     "\n",
307 |     "Amazon SageMaker Python SDK의 로컬 모드는 TensorFlow 또는 MXNet estimator서 단일 인자값을 변경하여 CPU (단일 및 다중 인스턴스) 및 GPU (단일 인스턴스) SageMaker 학습 작업을 에뮬레이션(enumlate)할 수 있습니다. \n",
308 |     "\n",
309 |     "로컬 모드 학습을 위해서는 docker-compose 또는 nvidia-docker-compose (GPU 인스턴스인 경우)의 설치가 필요합니다. 아래 코드 셀을 통해 본 노트북 환경에 docker-compose 또는 nvidia-docker-compose를 설치하고 구성합니다. \n",
310 |     " \n",
311 |     "로컬 모드의 학습을 통해 여러분의 코드가 현재 사용 중인 하드웨어를 적절히 활용하고 있는지 확인하기 위한 GPU 점유와 같은 지표(metric)를 쉽게 모니터링할 수 있습니다."
312 |    ]
313 |   },
314 |   {
315 |    "cell_type": "code",
316 |    "execution_count": null,
317 |    "metadata": {},
318 |    "outputs": [],
319 |    "source": [
320 |     "!wget -q https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-script-mode/master/local_mode_setup.sh\n",
321 |     "!wget -q https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-script-mode/master/daemon.json    \n",
322 |     "!/bin/bash ./local_mode_setup.sh"
323 |    ]
324 |   },
325 |   {
326 |    "cell_type": "code",
327 |    "execution_count": null,
328 |    "metadata": {},
329 |    "outputs": [],
330 |    "source": [
331 |     "import os\n",
332 |     "import sagemaker\n",
333 |     "from sagemaker import get_execution_role\n",
334 |     "\n",
335 |     "sagemaker_session = sagemaker.Session()\n",
336 |     "\n",
337 |     "role = get_execution_role()"
338 |    ]
339 |   },
340 |   {
341 |    "cell_type": "markdown",
342 |    "metadata": {},
343 |    "source": [
344 |     "학습 작업을 시작하기 위해 `estimator.fit() ` 호출 시, Amazon ECS에서 Amazon SageMaker TensorFlow 컨테이너를 로컬 노트북 인스턴스로 다운로드합니다.\n",
345 |     "\n",
346 |     "`sagemaker.tensorflow` 클래스를 사용하여 SageMaker Python SDK의 Tensorflow Estimator 인스턴스를 생성합니다.\n",
347 |     "인자값으로 하이퍼파라메터와 다양한 설정들을 변경할 수 있습니다.\n",
348 |     "\n",
349 |     "\n",
350 |     "자세한 내용은 [documentation](https://sagemaker.readthedocs.io/en/stable/using_tf.html#training-with-tensorflow-estimator)을 확인하시기 바랍니다."
351 |    ]
352 |   },
353 |   {
354 |    "cell_type": "code",
355 |    "execution_count": null,
356 |    "metadata": {},
357 |    "outputs": [],
358 |    "source": [
359 |     "from sagemaker.tensorflow import TensorFlow\n",
360 |     "estimator = TensorFlow(base_job_name='cifar10',\n",
361 |     "                       entry_point='cifar10_keras_sm_tf2.py',\n",
362 |     "                       source_dir='training_script',\n",
363 |     "                       role=role,\n",
364 |     "                       framework_version='2.0.0',\n",
365 |     "                       py_version='py3',\n",
366 |     "                       script_mode=True,\n",
367 |     "                       hyperparameters={'epochs' : 1},\n",
368 |     "                       train_instance_count=1, \n",
369 |     "                       train_instance_type='local')"
370 |    ]
371 |   },
372 |   {
373 |    "cell_type": "markdown",
374 |    "metadata": {},
375 |    "source": [
376 |     "학습을 수행할 3개의 채널과 데이터의 경로를 지정합니다. **로컬 모드로 수행하기 때문에 S3 경로 대신 노트북 인스턴스의 경로를 지정하시면 됩니다.**"
377 |    ]
378 |   },
379 |   {
380 |    "cell_type": "code",
381 |    "execution_count": null,
382 |    "metadata": {},
383 |    "outputs": [],
384 |    "source": [
385 |     "%%time\n",
386 |     "estimator.fit({'train': 'file://data/train',\n",
387 |     "               'validation': 'file://data/validation',\n",
388 |     "               'eval': 'file://data/eval'})"
389 |    ]
390 |   },
391 |   {
392 |    "cell_type": "markdown",
393 |    "metadata": {},
394 |    "source": [
395 |     "Estimator가 처음 실행될 때 Amazon ECR 리포지토리(repository)에서 컨테이너 이미지를 다운로드해야 하지만 학습을 즉시 시작할 수 있습니다. 즉, 별도의 학습 클러스터가 프로비저닝 될 때까지 기다릴 필요가 없습니다. 또한 반복 및 테스트시 필요할 수 있는 후속 실행에서 MXNet 또는 TensorFlow 스크립트에 대한 수정 사항이 즉시 실행되기 시작합니다."
396 |    ]
397 |   },
398 |   {
399 |    "cell_type": "markdown",
400 |    "metadata": {},
401 |    "source": [
402 |     "### Using SageMaker for faster training time\n",
403 |     "\n",
404 |     "이번에는 로컬 모드를 사용하지 않고 SageMaker 학습에 GPU 학습 인스턴스를 생성하여 학습 시간을 단축해 봅니다.<br>\n",
405 |     "로컬 모드와 다른 점들은 (1) `train_instance_type`이 로컬 모드의 ‘local’ 대신 여러분이 원하는 특정 인스턴스 유형으로 설정해야 하고, (2) 학습 데이터를 Amazon S3에 업로드 후 학습 경로를 S3 경로로 설정해야 합니다. \n",
406 |     "\n",
407 |     "SageMaker SDK는 S3 업로드를 위한 간단한 함수(`Session.upload_data()`)를 제공합니다. 이 함수를 통해 리턴되는 값은 데이터가 저장된 S3 경로입니다.\n",
408 |     "좀 더 자세한 설정이 필요하다면 SageMaker SDK 대신 boto3를 사용하시면 됩니다.\n",
409 |     "\n",
410 |     "*[Note]: 고성능 워크로드를 위해 Amazon EFS와 Amazon FSx for Lustre도 지원하고 있습니다. 자세한 정보는 아래의 AWS 블로그를 참조해 주세요.<br>\n",
411 |     "https://aws.amazon.com/blogs/machine-learning/speed-up-training-on-amazon-sagemaker-using-amazon-efs-or-amazon-fsx-for-lustre-file-systems/*"
412 |    ]
413 |   },
414 |   {
415 |    "cell_type": "code",
416 |    "execution_count": null,
417 |    "metadata": {},
418 |    "outputs": [],
419 |    "source": [
420 |     "dataset_location = sagemaker_session.upload_data(path='data', key_prefix='data/DEMO-cifar10')\n",
421 |     "display(dataset_location)"
422 |    ]
423 |   },
424 |   {
425 |    "cell_type": "markdown",
426 |    "metadata": {},
427 |    "source": [
428 |     "S3에 데이터 업로드를 완료했다면, Estimator를 새로 생성합니다. <br>\n",
429 |     "아래 코드를 그대로 복사 후에 `train_instance_type='local'`을 `train_instance_type='ml.p2.xlarge'`로 수정하고\n",
430 |     "`hyperparameters={'epochs': 1}`를 `hyperparameters={'epochs': 5}`로 수정합니다.\n",
431 |     "\n",
432 |     "```python\n",
433 |     "from sagemaker.tensorflow import TensorFlow\n",
434 |     "estimator = TensorFlow(base_job_name='cifar10',\n",
435 |     "                       entry_point='cifar10_keras_sm_tf2.py',\n",
436 |     "                       source_dir='training_script',\n",
437 |     "                       role=role,\n",
438 |     "                       framework_version='2.0.0',\n",
439 |     "                       py_version='py3',\n",
440 |     "                       script_mode=True,                       \n",
441 |     "                       hyperparameters={'epochs': 1},\n",
442 |     "                       train_instance_count=1, \n",
443 |     "                       train_instance_type='local')\n",
444 |     "```\n",
445 |     "\n",
446 |     "*[Note] \n",
447 |     "2019년 8월부터 SageMaker에서도 학습 인스턴스에 EC2 spot instance를 사용하여 비용을 크게 절감할 수 있습니다. 자세한 정보는 아래의 AWS 블로그를 참조해 주세요.<br>\n",
448 |     "https://aws.amazon.com/ko/blogs/korea/managed-spot-training-save-up-to-90-on-your-amazon-sagemaker-training-jobs/*\n",
449 |     "\n",
450 |     "만약 Managed Spot Instance로 학습하려면 다음 코드를 Estimator의 train_instance_type의 다음 행에 추가해 주세요.\n",
451 |     "```python\n",
452 |     "train_max_run = 3600,\n",
453 |     "train_use_spot_instances = 'True',\n",
454 |     "train_max_wait = 3600,\n",
455 |     "```"
456 |    ]
457 |   },
458 |   {
459 |    "cell_type": "code",
460 |    "execution_count": null,
461 |    "metadata": {},
462 |    "outputs": [],
463 |    "source": [
464 |     "from sagemaker.tensorflow import TensorFlow\n",
465 |     "estimator = TensorFlow(base_job_name='cifar10',\n",
466 |     "                       entry_point='cifar10_keras_sm_tf2.py',\n",
467 |     "                       source_dir='training_script',\n",
468 |     "                       role=role,\n",
469 |     "                       framework_version='2.0.0',\n",
470 |     "                       py_version='py3',\n",
471 |     "                       script_mode=True,                       \n",
472 |     "                       hyperparameters={'epochs': 5},\n",
473 |     "                       train_instance_count=1, \n",
474 |     "                       train_instance_type='ml.p2.xlarge')"
475 |    ]
476 |   },
477 |   {
478 |    "cell_type": "markdown",
479 |    "metadata": {},
480 |    "source": [
481 |     "학습을 수행합니다. 이번에는 각각의 채널(`train, validation, eval`)에 S3의 데이터 저장 위치를 지정합니다.<br>\n",
482 |     "학습 완료 후 Billable seconds도 확인해 보세요. Billable seconds는 실제로 학습 수행 시 과금되는 시간입니다.\n",
483 |     "```\n",
484 |     "Billable seconds: <time>\n",
485 |     "```\n",
486 |     "\n",
487 |     "참고로, `ml.p2.xlarge` 인스턴스로 5 epoch 학습 시 전체 6분-7분이 소요되고, 실제 학습에 소요되는 시간은 3분-4분이 소요됩니다."
488 |    ]
489 |   },
490 |   {
491 |    "cell_type": "code",
492 |    "execution_count": null,
493 |    "metadata": {},
494 |    "outputs": [],
495 |    "source": [
496 |     "%%time\n",
497 |     "estimator.fit({'train':'{}/train'.format(dataset_location),\n",
498 |     "              'validation':'{}/validation'.format(dataset_location),\n",
499 |     "              'eval':'{}/eval'.format(dataset_location)})"
500 |    ]
501 |   },
502 |   {
503 |    "cell_type": "markdown",
504 |    "metadata": {},
505 |    "source": [
506 |     "## Start a new SageMaker experiment\n",
507 |     "\n",
508 |     "Amazon SageMaker Experiments는 데이타 과학자들이 머신 러닝 실험을 구성하고, 추적하고, 비교하고, 평가할 수 있게 합니다.\n",
509 |     "머신 러닝은 반복적인 과정 입니다. 데이타 과학자들은 증분적인 모델 정확도의 변화를 관찰하면서, 데이타, 알고리즘, 파라미터의 조합들을 가지고 실험을 할 필요가 있습니다. 이러한 반복적인 과정은 수 많은 모델 훈련 및 모델의 버전들을 가지게 됩니다. 이것은 성능이 좋은 모델들 및 입력 설정의 구성들을 추적하기가 어렵게 됩니다. 이것은 더욱 더 증분적인 향상을 위한 기회를 찾기 위해서, 현재의 실험들과 과거에 수행한 실험들의 비교를 더욱 더 어렵게 합니다. \n",
510 |     "\n",
511 |     "**Amazon SageMaker Experiments는 반복적인 과정(시험, Trial)으로서의 입력 값들, 파라미터들, 구성 설정 값들 및 결과들을 자동으로 추적 할 수 있게 합니다.<br>\n",
512 |     "데이타 과학자들은 시험들(Trials)을 실험(Experiment) 안으로 할당하고, 그룹핑하고, 구성할 수 있습니다.**\n",
513 |     "Amazon SageMaker Experiments는 현재 및 과거의 실험들을 시각적으로 조회할 수 있게 하는 Amazon SageMaker Studio와 통합이 되어 있습니다. Amazon SageMaker Studio는 또한 주요 평가 지표를 가지고 시험들을 비교할 수 있으며, 가장 우수한 모델들을 확인할 수 있게 합니다.  \n"
514 |    ]
515 |   },
516 |   {
517 |    "cell_type": "markdown",
518 |    "metadata": {},
519 |    "source": [
520 |     "`sagemaker-experiments` 패키지를 먼저 설치합니다."
521 |    ]
522 |   },
523 |   {
524 |    "cell_type": "code",
525 |    "execution_count": null,
526 |    "metadata": {},
527 |    "outputs": [],
528 |    "source": [
529 |     "!pip install sagemaker-experiments"
530 |    ]
531 |   },
532 |   {
533 |    "cell_type": "markdown",
534 |    "metadata": {},
535 |    "source": [
536 |     "이제 실험(Experiment)을 만듭니다."
537 |    ]
538 |   },
539 |   {
540 |    "cell_type": "code",
541 |    "execution_count": null,
542 |    "metadata": {},
543 |    "outputs": [],
544 |    "source": [
545 |     "from smexperiments.experiment import Experiment\n",
546 |     "from smexperiments.trial import Trial\n",
547 |     "import time\n",
548 |     "\n",
549 |     "# Create an aexperiment\n",
550 |     "cifar10_experiment = Experiment.create(\n",
551 |     "    experiment_name=\"TensorFlow-cifar10-experiment\",\n",
552 |     "    description=\"Classification of cifar10 images\")"
553 |    ]
554 |   },
555 |   {
556 |    "cell_type": "markdown",
557 |    "metadata": {},
558 |    "source": [
559 |     "다음은 시험(Trial)을 생성 합니다. 이 시험은 GPU Instance 위에서 Epoch 5를 가지고 실행하게 됩니다."
560 |    ]
561 |   },
562 |   {
563 |    "cell_type": "code",
564 |    "execution_count": null,
565 |    "metadata": {},
566 |    "outputs": [],
567 |    "source": [
568 |     "# Create a trial\n",
569 |     "trial_name = f\"cifar10-training-job-{int(time.time())}\"\n",
570 |     "trial = Trial.create(\n",
571 |     "    trial_name=trial_name, \n",
572 |     "    experiment_name=cifar10_experiment.experiment_name\n",
573 |     ")"
574 |    ]
575 |   },
576 |   {
577 |    "cell_type": "markdown",
578 |    "metadata": {},
579 |    "source": [
580 |     "새로운 estimator를 생성 합니다."
581 |    ]
582 |   },
583 |   {
584 |    "cell_type": "code",
585 |    "execution_count": null,
586 |    "metadata": {},
587 |    "outputs": [],
588 |    "source": [
589 |     "from sagemaker.tensorflow import TensorFlow\n",
590 |     "estimator = TensorFlow(base_job_name='cifar10',\n",
591 |     "                       entry_point='cifar10_keras_sm_tf2.py',\n",
592 |     "                       source_dir='training_script',\n",
593 |     "                       role=role,\n",
594 |     "                       framework_version='2.0.0',\n",
595 |     "                       py_version='py3',\n",
596 |     "                       hyperparameters={'epochs' : 5},\n",
597 |     "                       train_instance_count=1, \n",
598 |     "                       train_instance_type='ml.p2.xlarge')"
599 |    ]
600 |   },
601 |   {
602 |    "cell_type": "markdown",
603 |    "metadata": {},
604 |    "source": [
605 |     "다음은 각각 입력 데이타의 채널에 대한 S3 data location을 사용합니다.\n",
606 |     "```python\n",
607 |     "dataset_location + '/train'\n",
608 |     "dataset_location + '/validation' \n",
609 |     "dataset_location + '/eval'\n",
610 |     "```\n",
611 |     "위에서 설정한 experiment config를 fit 함수의 파라미터로 추가합니다. 또한 시험은 훈련 Job과 연결이 됩니다.\n",
612 |     "<br>TrialComponent는 시험(Trail)의 한 요소를 의미합니다. 여기서는 \"Training\"의 훈련 요소를 지칭합니다.\n",
613 |     "```python\n",
614 |     "experiment_config={\n",
615 |     "                  \"ExperimentName\": cifar10_experiment.experiment_name, \n",
616 |     "                  \"TrialName\": trial.trial_name,\n",
617 |     "                  \"TrialComponentDisplayName\": \"Training\"}\n",
618 |     "```"
619 |    ]
620 |   },
621 |   {
622 |    "cell_type": "code",
623 |    "execution_count": null,
624 |    "metadata": {},
625 |    "outputs": [],
626 |    "source": [
627 |     "estimator.fit({'train' :  dataset_location + '/train',\n",
628 |     "               'validation' :  dataset_location + '/validation',\n",
629 |     "               'eval' :  dataset_location + '/eval'\n",
630 |     "              },\n",
631 |     "              experiment_config={\n",
632 |     "                \"ExperimentName\": cifar10_experiment.experiment_name, \n",
633 |     "                \"TrialName\": trial.trial_name,\n",
634 |     "                \"TrialComponentDisplayName\": \"Training\"\n",
635 |     "              }\n",
636 |     "            )"
637 |    ]
638 |   },
639 |   {
640 |    "cell_type": "markdown",
641 |    "metadata": {},
642 |    "source": [
643 |     "## Analyze the experiments"
644 |    ]
645 |   },
646 |   {
647 |    "cell_type": "markdown",
648 |    "metadata": {},
649 |    "source": [
650 |     "여기서는 DisplayName 이 \"Training\"과 같은 시험 요소(Trial Component)만 찾는 필터를 생성합니다. <br>\n",
651 |     "위에서 설정한 TrialComponentDisplayName\": \"Training\" 을 찾게 됩니다."
652 |    ]
653 |   },
654 |   {
655 |    "cell_type": "code",
656 |    "execution_count": null,
657 |    "metadata": {},
658 |    "outputs": [],
659 |    "source": [
660 |     "search_expression = {\n",
661 |     "    \"Filters\":[\n",
662 |     "        {\n",
663 |     "            \"Name\": \"DisplayName\",\n",
664 |     "            \"Operator\": \"Equals\",\n",
665 |     "            \"Value\": \"Training\",\n",
666 |     "        }\n",
667 |     "    ],\n",
668 |     "}"
669 |    ]
670 |   },
671 |   {
672 |    "cell_type": "markdown",
673 |    "metadata": {},
674 |    "source": [
675 |     "ExperimentAnalytics 함수에 experiment 이름과 위에서 생성한 필터를 파라미터로 제공합니다."
676 |    ]
677 |   },
678 |   {
679 |    "cell_type": "code",
680 |    "execution_count": null,
681 |    "metadata": {},
682 |    "outputs": [],
683 |    "source": [
684 |     "import pandas as pd \n",
685 |     "pd.options.display.max_columns = 500\n",
686 |     "\n",
687 |     "from sagemaker.analytics import ExperimentAnalytics\n",
688 |     "trial_component_analytics = ExperimentAnalytics(\n",
689 |     "    sagemaker_session=sagemaker_session, \n",
690 |     "    experiment_name=cifar10_experiment.experiment_name,\n",
691 |     "    search_expression=search_expression\n",
692 |     ")\n",
693 |     "\n",
694 |     "table = trial_component_analytics.dataframe(force_refresh=True)\n",
695 |     "display(table)"
696 |    ]
697 |   },
698 |   {
699 |    "cell_type": "markdown",
700 |    "metadata": {},
701 |    "source": [
702 |     "### Clean up the Experiment\n",
703 |     "experiment 이름은 계정과 리젼에 유니크한 이름이기에, 사용을 하지 않는다면 지워주는 것이 좋습니다.<br>\n",
704 |     "위에서 생성한 cifar10_experiment 오브젝트를 아래 cleanup 함수에 파라미터로 주어서 지워주게 됩니다.\n",
705 |     "이 작업은 관련된 Trial Component, Trial 을 지우고, 마지막으로 experiment를 삭제합니다."
706 |    ]
707 |   },
708 |   {
709 |    "cell_type": "code",
710 |    "execution_count": null,
711 |    "metadata": {},
712 |    "outputs": [],
713 |    "source": [
714 |     "import boto3\n",
715 |     "\n",
716 |     "sess = boto3.Session()\n",
717 |     "sm = sess.client('sagemaker')\n",
718 |     "from smexperiments.trial_component import TrialComponent\n",
719 |     "\n",
720 |     "def cleanup(experiment):\n",
721 |     "    for trial_summary in experiment.list_trials():\n",
722 |     "        trial = Trial.load(sagemaker_boto_client=sm, trial_name=trial_summary.trial_name)\n",
723 |     "        for trial_component_summary in trial.list_trial_components():\n",
724 |     "            tc = TrialComponent.load(\n",
725 |     "                sagemaker_boto_client=sm,\n",
726 |     "                trial_component_name=trial_component_summary.trial_component_name)\n",
727 |     "            trial.remove_trial_component(tc)\n",
728 |     "            try:\n",
729 |     "                # comment out to keep trial components\n",
730 |     "                tc.delete()\n",
731 |     "            except:\n",
732 |     "                # tc is associated with another trial\n",
733 |     "                continue\n",
734 |     "            # to prevent throttling\n",
735 |     "            time.sleep(.5)\n",
736 |     "        trial.delete()\n",
737 |     "    experiment.delete()\n",
738 |     "    print(\"The experiemnt is deleted\")\n",
739 |     "\n",
740 |     "\n",
741 |     "cleanup(cifar10_experiment)    "
742 |    ]
743 |   },
744 |   {
745 |    "cell_type": "markdown",
746 |    "metadata": {},
747 |    "source": [
748 |     "**잘 하셨습니다.** \n",
749 |     "\n",
750 |     "SageMaker에서 GPU 인스턴스를 사용해 5 epoch를 정상적으로 학습할 수 있었습니다.<br>\n",
751 |     "다음 노트북으로 계속 진행하기 전에 SageMaker 콘솔의 Training jobs 섹션을 살펴보고 여러분이 수행한 job을 찾아 configuration을 확인하세요.\n",
752 |     "\n",
753 |     "스크립트 모드 학습에 대한 자세한 내용은 아래의 AWS 블로그를 참조해 주세요.<br>\n",
754 |     "[Using TensorFlow eager execution with Amazon SageMaker script mode](https://aws.amazon.com/ko/blogs/machine-learning/using-tensorflow-eager-execution-with-amazon-sagemaker-script-mode/)"
755 |    ]
756 |   }
757 |  ],
758 |  "metadata": {
759 |   "kernelspec": {
760 |    "display_name": "conda_tensorflow_p36",
761 |    "language": "python",
762 |    "name": "conda_tensorflow_p36"
763 |   },
764 |   "language_info": {
765 |    "codemirror_mode": {
766 |     "name": "ipython",
767 |     "version": 3
768 |    },
769 |    "file_extension": ".py",
770 |    "mimetype": "text/x-python",
771 |    "name": "python",
772 |    "nbconvert_exporter": "python",
773 |    "pygments_lexer": "ipython3",
774 |    "version": "3.6.6"
775 |   }
776 |  },
777 |  "nbformat": 4,
778 |  "nbformat_minor": 4
779 | }
780 | 


--------------------------------------------------------------------------------
/1_Monitoring_your_TensorFlow_scripts.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# [Module 2] Monitor and Analyze Training Jobs Using Metrics\n",
  8 |     "\n",
  9 |     "Amazon SageMaker 학습 작업은 모델이 학습 데이터셋에서 샘플들을 제시하여 예측하도록 가르치는 반복적인 프로세스입니다.\n",
 10 |     "일반적으로 학습 알고리즘은 학습 오차(training error) 및 예측 정확도(prediction accuracy)와 같은 여러 지표들(metrics)을 계산합니다. 이러한 지표들은 모델이 잘 학습되고 있는지 확인하고 신규 데이터를 잘 예측할 수 있도록 일반화합니다.\n",
 11 |     "\n",
 12 |     "학습 알고리즘은 이러한 지표들의 값을 Amazon SageMaker가 모니터링하고 실시간으로 Amazon CloudWatch로 전송하는 로그에 기록합니다.\n",
 13 |     "\n",
 14 |     "Amazon SageMaker가 사용자 정의 알고리즘의 로그를 파싱하고 알고리즘이 생성하는 지표를 CloudWatch로 보내려면, 학습 작업을 설정할 때 Amazon SageMaker가 CloudWatch로 전송할 지표를 지정해야 합니다.<br>\n",
 15 |     "이 때, 여러분은 전송하려는 지표들의 이름과 정규표현식들(regular expressions)을 지정해야 합니다. 정규표현식들은 Amazon SageMaker가 해당 지표를 찾기 위해 필요합니다.\n",
 16 |     "\n",
 17 |     "본 실습에서는 Amazon SageMaker를 활용한 지표 모니터링에 대해 실시합니다. 실습 전반에서는 Amazon CloudWatch를 사용하여 모니터링을 실시합니다.\n",
 18 |     "실습 후반에서는 Keras에서 출력하는 학습 로그를 활용하여 TensorBoard에서 학습 상황을 모니터링해 봅니다."
 19 |    ]
 20 |   },
 21 |   {
 22 |    "cell_type": "markdown",
 23 |    "metadata": {},
 24 |    "source": [
 25 |     "## Defining Training Metrics (Amazon SageMaker Python SDK)\n",
 26 |     "\n",
 27 |     "Estimator 객체를 초기화할 때 지표 이름 및 정규식 목록을 metric_definitions 인수로 지정하여 CloudWatch로 보내려는 지표를 정의하세요. 예를 들어 CloudWatch에서 train:error 및 validation:error 지표를 모두 모니터링하려는 경우 Estimator 초기화 방법은 다음 코드와 같습니다.\n",
 28 |     "\n",
 29 |     "```python \n",
 30 |     "estimator = Estimator(image_name=ImageName,\n",
 31 |     "            role='SageMakerRole', train_instance_count=1,\n",
 32 |     "            train_instance_type='ml.c4.xlarge',            \n",
 33 |     "            k=10,\n",
 34 |     "            metric_definitions=[\n",
 35 |     "                   {'Name': 'train:error', 'Regex': 'Train_error=(.*?);'},\n",
 36 |     "                   {'Name': 'validation:error', 'Regex': 'Valid_error=(.*?);'}\n",
 37 |     "            ])\n",
 38 |     "```"
 39 |    ]
 40 |   },
 41 |   {
 42 |    "cell_type": "markdown",
 43 |    "metadata": {},
 44 |    "source": [
 45 |     "## Monitoring the CIFAR10 training\n",
 46 |     "아래 작업들을 직접 수행해 보세요.\n",
 47 |     "- SageMaker 콘솔에서 이전에 학습했던 학습 작업(cifar10_keras_sm)을 찾아 보세요.\n",
 48 |     "- 작업 세부 사항(job details)을 열고 CloudWatch 로그를 확인합니다.\n",
 49 |     "- 로그에 맞는 지표 정규식을 설정하세요. 정규식 도구들(regex tools)을 사용하여 정규식을 확인하고 ()를 사용하여 각 matric을 확인합니다."
 50 |    ]
 51 |   },
 52 |   {
 53 |    "cell_type": "code",
 54 |    "execution_count": null,
 55 |    "metadata": {},
 56 |    "outputs": [],
 57 |    "source": [
 58 |     "metric_definitions = [\n",
 59 |     "    {'Name': 'train:loss', 'Regex': 'loss: (.*?) '},\n",
 60 |     "    {'Name': 'train:accuracy', 'Regex': 'acc: (.*?) '},\n",
 61 |     "    {'Name': 'validation:loss', 'Regex': 'val_loss: (.*?) '},\n",
 62 |     "    {'Name': 'validation:accuracy', 'Regex': 'val_acc: (.*?) '}\n",
 63 |     "]"
 64 |    ]
 65 |   },
 66 |   {
 67 |    "cell_type": "code",
 68 |    "execution_count": null,
 69 |    "metadata": {},
 70 |    "outputs": [],
 71 |    "source": [
 72 |     "import os\n",
 73 |     "import sagemaker\n",
 74 |     "from sagemaker import get_execution_role\n",
 75 |     "\n",
 76 |     "sagemaker_session = sagemaker.Session()\n",
 77 |     "role = get_execution_role()"
 78 |    ]
 79 |   },
 80 |   {
 81 |    "cell_type": "code",
 82 |    "execution_count": null,
 83 |    "metadata": {},
 84 |    "outputs": [],
 85 |    "source": [
 86 |     "prefix = 'data/DEMO-cifar10'\n",
 87 |     "dataset_location = os.path.join('s3://', sagemaker_session.default_bucket(), prefix)"
 88 |    ]
 89 |   },
 90 |   {
 91 |    "cell_type": "markdown",
 92 |    "metadata": {},
 93 |    "source": [
 94 |     "`0_Running_TensorFlow_In_SageMaker.ipynb`의 estimator 인스턴스 생성 코드를 그대로 복사 후에\n",
 95 |     "`metric_definitions=metric_definitions` 인자를 추가해 주세요. "
 96 |    ]
 97 |   },
 98 |   {
 99 |    "cell_type": "code",
100 |    "execution_count": null,
101 |    "metadata": {},
102 |    "outputs": [],
103 |    "source": [
104 |     "from sagemaker.tensorflow import TensorFlow\n",
105 |     "estimator = TensorFlow(base_job_name='cifar10-cloudwatch',\n",
106 |     "                       entry_point='cifar10_keras_sm.py',\n",
107 |     "                       source_dir='training_script',\n",
108 |     "                       role=role,\n",
109 |     "                       framework_version='1.12.0',\n",
110 |     "                       py_version='py3',\n",
111 |     "                       script_mode=True,                       \n",
112 |     "                       hyperparameters={'epochs': 5},\n",
113 |     "                       train_instance_count=1,\n",
114 |     "                       train_instance_type='ml.p2.xlarge',\n",
115 |     "                       metric_definitions=metric_definitions) # 추가"
116 |    ]
117 |   },
118 |   {
119 |    "cell_type": "code",
120 |    "execution_count": null,
121 |    "metadata": {},
122 |    "outputs": [],
123 |    "source": [
124 |     "%time\n",
125 |     "estimator.fit({'train':'{}/train'.format(dataset_location),\n",
126 |     "              'validation':'{}/validation'.format(dataset_location),\n",
127 |     "              'eval':'{}/eval'.format(dataset_location)})"
128 |    ]
129 |   },
130 |   {
131 |    "cell_type": "markdown",
132 |    "metadata": {},
133 |    "source": [
134 |     "### View the job training metrics\n",
135 |     "SageMaker는 위에서 구성한 정규식을 사용하여 작업 지표(job metrics)를 CloudWatch 지표로 전송했습니다.\n",
136 |     "이제 SageMaker 콘솔에서 직접 작업 지표을 보실 수 있습니다.\n",
137 |     "\n",
138 |     "[SageMaker console](https://console.aws.amazon.com/sagemaker/home) 에 로그인하여 최신 학습 작업(latest training job)을 선택하고 모니터 섹션(monitor section)까지 아래로 스크롤하십시오.\n",
139 |     "CloudWatch 지표를 사용하여 기간(period)을 변경하고 통계치(statistics)들을 설정할 수 있습니다\n",
140 |     "\n",
141 |     "다음 셀(cell)을 사용하여 지표를 찾아 보세요."
142 |    ]
143 |   },
144 |   {
145 |    "cell_type": "code",
146 |    "execution_count": null,
147 |    "metadata": {},
148 |    "outputs": [],
149 |    "source": [
150 |     "from IPython.core.display import Markdown\n",
151 |     "\n",
152 |     "link = 'https://console.aws.amazon.com/cloudwatch/home?region='+sagemaker_session.boto_region_name+'#metricsV2:query=%7B/aws/sagemaker/TrainingJobs,TrainingJobName%7D%20'+estimator.latest_training_job.job_name\n",
153 |     "display(Markdown('CloudWatch metrics: [link]('+link+')'))\n",
154 |     "display(Markdown('After you choose a metric, change the period to 1 Minute (Graphed Metrics -> Period)'))"
155 |    ]
156 |   },
157 |   {
158 |    "cell_type": "code",
159 |    "execution_count": null,
160 |    "metadata": {},
161 |    "outputs": [],
162 |    "source": [
163 |     "display(Markdown('CloudWatch metrics: [link]('+link+')'))\n",
164 |     "display(Markdown('After you choose a metric, change the period to 1 Minute (Graphed Metrics -> Period)'))"
165 |    ]
166 |   },
167 |   {
168 |    "cell_type": "markdown",
169 |    "metadata": {},
170 |    "source": [
171 |     "## Monitor with TensorBoard\n",
172 |     "\n",
173 |     "이번에는 TensorBoard로 학습 작업을 실시간으로 모니터링해 봅니다.<br>\n",
174 |     "TensorBoard는 머신 러닝 실험에 필요한 시각화 및 도구들을 제공합니다.\n",
175 |     "* Loss 및 accuracy과 같은 metric 추적 및 시각화\n",
176 |     "* 모델 그래프 (ops 및 layers) 시각화\n",
177 |     "* 시간 경에 따라 변화하는 가중치(weights), 바이어스(biases) 또는 기타 텐서의 히스토그램 확인\n",
178 |     "* 저차원 공간으로 임베딩(embedding)\n",
179 |     "* 이미지, 텍스트 및 오디오 데이터 표시\n",
180 |     "* 기타 \n",
181 |     "\n",
182 |     "**`training_script/cifar10_keras_sm.py`에서 스크립트 사본을 생성 후, `training_script/cifar10_keras_tensorboard.py`로 저장하세요.**\n",
183 |     "\n",
184 |     "스크립트 사본을 생성하였다면 단계별로 아래의 작업들을 직접 시도합니다.\n",
185 |     "\n",
186 |     "----\n",
187 |     "### TODO 1.\n",
188 |     "\n",
189 |     "먼저, 학습 로그를 TensorBoard로 전송하기 위해 `cifar10_keras_tensorboard.py` 스크립트를 수정합니다.<br>\n",
190 |     "Keras에서 TensorBoard를 사용하려면 코드 상단에 `from keras.callbacks import TensorBoard` 구문을 추가해 주세요.\n",
191 |     "\n",
192 |     "Keras는 디폴트로 각 배치(batch)마다 TensorBoard 로그를 보냅니다. S3에 로그를 전송하면 학습 작업이 느려지기 때문에,\n",
193 |     "TensorBoard callback을 각 epoch의 끝에서만 로그를 전송할 수 있게 변경해 주는 것이 좋습니다.\n",
194 |     "\n",
195 |     "----\n",
196 |     "### TODO 2.\n",
197 |     "\n",
198 |     "스크립트에 TensorBoard callback을 추가해 주세요 (ModelCheckpoint callback 바로 다음 라인에 아래 줄을 추가해 주세요).\n",
199 |     "```python\n",
200 |     "callbacks.append(TensorBoard(log_dir=args.model_output_dir,update_freq='epoch'))\n",
201 |     "```\n",
202 |     "\n",
203 |     "<font color='blue'>**본 노트북 실습에 어려움이 있다면 솔루션 파일 `training_script/cifar10_keras_tensorboard_solution.py`을 참조하시면 됩니다.**</font>"
204 |    ]
205 |   },
206 |   {
207 |    "cell_type": "markdown",
208 |    "metadata": {},
209 |    "source": [
210 |     "### Run a training job with TensorBoard support"
211 |    ]
212 |   },
213 |   {
214 |    "cell_type": "code",
215 |    "execution_count": null,
216 |    "metadata": {},
217 |    "outputs": [],
218 |    "source": [
219 |     "from sagemaker.tensorflow import TensorFlow\n",
220 |     "estimator = TensorFlow(base_job_name='cifar10-tensorboard',\n",
221 |     "#                       entry_point='cifar10_keras_tensorboard.py',\n",
222 |     "                       entry_point='cifar10_keras_tensorboard.py',                       \n",
223 |     "                       source_dir='training_script',\n",
224 |     "                       role=role,\n",
225 |     "                       framework_version='1.12.0',\n",
226 |     "                       py_version='py3',\n",
227 |     "                       hyperparameters={'epochs' : 5},\n",
228 |     "                       train_instance_count=1,\n",
229 |     "                       train_instance_type='ml.p2.xlarge',\n",
230 |     "                       metric_definitions=metric_definitions)"
231 |    ]
232 |   },
233 |   {
234 |    "cell_type": "code",
235 |    "execution_count": null,
236 |    "metadata": {},
237 |    "outputs": [],
238 |    "source": [
239 |     "estimator.fit({'train':'{}/train'.format(dataset_location),\n",
240 |     "              'validation':'{}/validation'.format(dataset_location),\n",
241 |     "#              'eval':'{}/eval'.format(dataset_location)}, wait=False)\n",
242 |     "              'eval':'{}/eval'.format(dataset_location)}, wait=True)\n",
243 |     " # Use wait=False to run async jobs"
244 |    ]
245 |   },
246 |   {
247 |    "cell_type": "markdown",
248 |    "metadata": {},
249 |    "source": [
250 |     "### Install Tensorboard on your local machine\n",
251 |     "\n",
252 |     "`pip install tensorboard`를 사용하여 [TensorBoard](https://github.com/tensorflow/tensorboard)를 로컬에 설치해 주세요.\n",
253 |     "S3 로그 디렉토리에 접근하기 위해 TensorBoard 기본 리젼을 설정해 주세요. 여러분은 `AWS_REGION`이라는 환경 변수를 설정하고 환경 변수의 값을 학습 작업이 실행되는 AWS 리전으로 설정하면 됩니다.\n",
254 |     "예를 들면 `AWS_REGION='us-east-2' tensorboard --logdir model_dir` 입니다.\n",
255 |     "\n",
256 |     "**여러분은 다음 셀(cell)에서 TensorBoard command를 얻을 수 있습니다.**\n",
257 |     "\n",
258 |     "여러분은 model_dir(S3 위치)에 접근 하기 위해서는 AccessKey + SecretKey 가 필요 합니다. \n",
259 |     "이벤트 엔진을 사용하시는 분은 아래에서 얻을 수 있습니다. AccessKey + SecretKey 를 프로프트 창에 실행 시키시고 아래의 가이드에 따라 진행 해주세요,  <br>\n",
260 |     "https://dashboard.eventengine.run/dashboard"
261 |    ]
262 |   },
263 |   {
264 |    "cell_type": "code",
265 |    "execution_count": null,
266 |    "metadata": {},
267 |    "outputs": [],
268 |    "source": [
269 |     "!pip install tensorboard"
270 |    ]
271 |   },
272 |   {
273 |    "cell_type": "code",
274 |    "execution_count": null,
275 |    "metadata": {},
276 |    "outputs": [],
277 |    "source": [
278 |     "from IPython.core.display import Markdown\n",
279 |     "\n",
280 |     "link = 'AWS_REGION=\\''+sagemaker_session.boto_region_name+'\\' tensorboard --logdir ' + estimator.model_dir + ' --host localhost --port 6006'\n",
281 |     "display(Markdown('본 셀의 output(Tensorboard command) 기억 해주세요.'))\n",
282 |     "display(Markdown(link))"
283 |    ]
284 |   },
285 |   {
286 |    "cell_type": "markdown",
287 |    "metadata": {},
288 |    "source": [
289 |     "**이후에 다음 순서와 같이 진행 해 주세요.**\n",
290 |     "* AWS Console에서 S3 서비스로 이동 합니다.\n",
291 |     "* 위의 결과 S3 위치를 보면서 (예: s3://sagemaker-us-east-2-057716757052/cifar10-tensorboard-2020-03-01-09-59-29-896) 버킷과 폴더를 확인 하세요.<br>\n",
292 |     "(s3://<버킷 이름>/<폴더 이름>)\n",
293 |     "* 버킷 --> 폴더 까지 이동 하신 후에 model 폴더를 생성 하세요.\n",
294 |     "* 버킷 --> 폴더 --> output --> model.tar.gz 파일을 로컬에 다운로드 하세요.\n",
295 |     "* 로컬에서 model.tar.gz를 압추 해제를 하면 event.out.tfevents.* 파일이 있습니다. <br>\n",
296 |     "(예: events.out.tfevents.1583057000.ip-10-0-136-231.us-east-2.compute.internal)<br>\n",
297 |     "이 파일을 전에 생성한 버킷 --> 폴더 --> model 안에 업로드 해주세요.\n",
298 |     "* 위 셀의 결과인 Tensorboard command 을 (예: AWS_REGION='us-east-2' tensorboard --logdir s3://sagemaker-us-east-2-057716757052/cifar10-tensorboard-2020-03-01-09-59-29-896/model --host localhost --port 6006) 로컬의 프로프트 창에 넣어 실행 해주세요.\n",
299 |     "* 이후에 브라우저(예:크롬) 에서 localhost:6006 을 주소창에 넣고 실행 해주시면 됩니다.\n",
300 |     "* 아래와 같이 tensorboard 가 보이실 겁니다.<br>\n",
301 |     "![tensorboard](./images/tensorboard.png)"
302 |    ]
303 |   },
304 |   {
305 |    "cell_type": "markdown",
306 |    "metadata": {},
307 |    "source": [
308 |     "**잘 하셨습니다.**\n",
309 |     "\n",
310 |     "이제 여러분의 학습 작업을 CloudWatch 지표와 TensorBoard로 확인할 수 있습니다.<br>\n",
311 |     "다음 노트북으로 계속 진행하기 전에 다른 TensorBoard 설정값들을 [TensorBoard callback configuration](https://keras.io/callbacks/#tensorboard)에서 살펴보세요."
312 |    ]
313 |   }
314 |  ],
315 |  "metadata": {
316 |   "kernelspec": {
317 |    "display_name": "conda_tensorflow_p36",
318 |    "language": "python",
319 |    "name": "conda_tensorflow_p36"
320 |   },
321 |   "language_info": {
322 |    "codemirror_mode": {
323 |     "name": "ipython",
324 |     "version": 3
325 |    },
326 |    "file_extension": ".py",
327 |    "mimetype": "text/x-python",
328 |    "name": "python",
329 |    "nbconvert_exporter": "python",
330 |    "pygments_lexer": "ipython3",
331 |    "version": "3.6.5"
332 |   }
333 |  },
334 |  "nbformat": 4,
335 |  "nbformat_minor": 4
336 | }
337 | 


--------------------------------------------------------------------------------
/2_Using_Pipemode_input_for_big_datasets.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# [Module 3] Training with Pipe Mode using PipeModeDataset\n",
  8 |     "\n",
  9 |     "Amazon SageMaker를 사용하면 Pipe 입력 모드를 사용하여 교육 작업을 생성할 수 있습니다. **Pipe 입력 모드를 사용하면 S3의 학습 데이터셋을 노트북 인스턴스의 로컬 디스크로 다운로드하는 대신 학습 인스턴스로 직접 스트리밍합니다.** 즉, 학습 작업이 더 빨리 시작되고 더 빨리 완료되며 더 적은 디스크 공간이 필요합니다.\n",
 10 |     "\n",
 11 |     "SageMaker TensorFlow는 SageMaker에서 Pipe 입력 모드를 쉽게 활용할 수있는 `tf.data.Dataset`의 구현을 제공합니다. `tf.data.Dataset`을`sagemaker_tensorflow.PipeModeDataset`으로 바꾸면 학습 인스턴스로 스트리밍되는 TFRecord를 읽을 수 있습니다.\n",
 12 |     "\n",
 13 |     "여러분의 entry_point 스크립트에서 `PipeModeDataset`을 `Dataset`처럼 사용할 수 있습니다. 아래 예시는 '학습' 채널에서 TFRecords를 읽을 수있는`PipeModeDataset`을 생성하는 예시입니다.\n",
 14 |     "\n",
 15 |     "```python\n",
 16 |     "from sagemaker_tensorflow import PipeModeDataset\n",
 17 |     "\n",
 18 |     "features = {\n",
 19 |     "    'data': tf.FixedLenFeature([], tf.string),\n",
 20 |     "    'labels': tf.FixedLenFeature([], tf.int64),\n",
 21 |     "}\n",
 22 |     "\n",
 23 |     "def parse(record):\n",
 24 |     "    parsed = tf.parse_single_example(record, features)\n",
 25 |     "    return ({\n",
 26 |     "        'data': tf.decode_raw(parsed['data'], tf.float64)\n",
 27 |     "    }, parsed['labels'])\n",
 28 |     "\n",
 29 |     "def train_input_fn(training_dir, hyperparameters):\n",
 30 |     "    ds = PipeModeDataset(channel='training', record_format='TFRecord')\n",
 31 |     "    ds = ds.repeat(20)\n",
 32 |     "    ds = ds.prefetch(10)\n",
 33 |     "    ds = ds.map(parse, num_parallel_calls=10)\n",
 34 |     "    ds = ds.batch(64)\n",
 35 |     "    return ds\n",
 36 |     "```\n",
 37 |     "\n",
 38 |     "Pipe 입력 모드에서 학습 작업을 구동하려면, 아래 예시처럼 여러분의 TensorFlow Estimator에서 `input_mode='Pipe'` 인자를 추가해 주세요.\n",
 39 |     "\n",
 40 |     "```python\n",
 41 |     "from sagemaker.tensorflow import TensorFlow\n",
 42 |     "\n",
 43 |     "tf_estimator = TensorFlow(entry_point='tf-train-with-pipemodedataset.py', role='SageMakerRole',\n",
 44 |     "                          train_instance_count=1, train_instance_type='ml.c5.2xlarge',\n",
 45 |     "                          framework_version='1.14.0', input_mode='Pipe')\n",
 46 |     "\n",
 47 |     "tf_estimator.fit('s3://bucket/path/to/training/data')\n",
 48 |     "```"
 49 |    ]
 50 |   },
 51 |   {
 52 |    "cell_type": "markdown",
 53 |    "metadata": {},
 54 |    "source": [
 55 |     "## Create a training script that support pipemode datasets\n",
 56 |     "**`training_script/cifar10_keras_sm.py`에서 스크립트 사본을 만들어 `training_script/cifar10_keras_pipe.py`로 저장하세요.**\n",
 57 |     "\n",
 58 |     "스크립트 사본을 생성하였다면 단계별로 아래의 작업들을 직접 시도합니다.\n",
 59 |     "\n",
 60 |     "----\n",
 61 |     "### TODO 1.\n",
 62 |     "`cifar10_keras_pipe.py`에서 아래와 같이 `PipeModeDataset`를 import해 주세요.\n",
 63 |     "```python\n",
 64 |     "from sagemaker_tensorflow import PipeModeDataset\n",
 65 |     "```\n",
 66 |     "----\n",
 67 |     "### TODO 2.\n",
 68 |     "```python\n",
 69 |     "def _input(epochs, batch_size, channel, channel_name):\n",
 70 |     "```\n",
 71 |     "함수 내에서\n",
 72 |     "```python\n",
 73 |     "dataset = tf.data.TFRecordDataset(filenames)\n",
 74 |     "```\n",
 75 |     "을\n",
 76 |     "```python\n",
 77 |     "dataset = PipeModeDataset(channel=channel_name, record_format='TFRecord')\n",
 78 |     "```\n",
 79 |     "로 수정해 주세요.\n",
 80 |     "\n",
 81 |     "자세한 내용은 SageMaker-python-sdk [documentation](https://sagemaker.readthedocs.io/en/stable/using_tf.html#training-with-pipe-mode-using-pipemodedataset)를 참조해 주세요.\n",
 82 |     "\n",
 83 |     "<font color='blue'>**본 노트북 실습에 어려움이 있다면 솔루션 파일 `training_script/cifar10_keras_pipe_solution.py`을 참조하시면 됩니다.**</font>"
 84 |    ]
 85 |   },
 86 |   {
 87 |    "cell_type": "code",
 88 |    "execution_count": null,
 89 |    "metadata": {},
 90 |    "outputs": [],
 91 |    "source": [
 92 |     "import os\n",
 93 |     "import sagemaker\n",
 94 |     "from sagemaker import get_execution_role\n",
 95 |     "\n",
 96 |     "sagemaker_session = sagemaker.Session()\n",
 97 |     "\n",
 98 |     "role = get_execution_role()"
 99 |    ]
100 |   },
101 |   {
102 |    "cell_type": "code",
103 |    "execution_count": null,
104 |    "metadata": {},
105 |    "outputs": [],
106 |    "source": [
107 |     "prefix = 'data/DEMO-cifar10'\n",
108 |     "dataset_location = os.path.join('s3://', sagemaker_session.default_bucket(), prefix)"
109 |    ]
110 |   },
111 |   {
112 |    "cell_type": "markdown",
113 |    "metadata": {},
114 |    "source": [
115 |     "`input_mode='Pipe'`로 Pipe Mode를 활성화한 후에 10 epoch를 학습해 보세요.\n",
116 |     "모든 작업에 metric_definitions 인수를 추가해야 합니다.<br>\n",
117 |     "또한, AWS 콘솔 가시성을 위해 `base_job_name`을 `'cifar10-pipe'`로 설정해 주세요."
118 |    ]
119 |   },
120 |   {
121 |    "cell_type": "code",
122 |    "execution_count": null,
123 |    "metadata": {},
124 |    "outputs": [],
125 |    "source": [
126 |     "metric_definitions = [\n",
127 |     "    {'Name': 'train:loss', 'Regex': 'loss: (.*?) '},\n",
128 |     "    {'Name': 'train:accuracy', 'Regex': 'acc: (.*?) '},\n",
129 |     "    {'Name': 'validation:loss', 'Regex': 'val_loss: (.*?) '},\n",
130 |     "    {'Name': 'validation:accuracy', 'Regex': 'val_acc: (.*?) '}\n",
131 |     "]"
132 |    ]
133 |   },
134 |   {
135 |    "cell_type": "code",
136 |    "execution_count": null,
137 |    "metadata": {},
138 |    "outputs": [],
139 |    "source": [
140 |     "from sagemaker.tensorflow import TensorFlow\n",
141 |     "estimator = TensorFlow(base_job_name='cifar10-pipe',\n",
142 |     "                       entry_point='cifar10_keras_pipe.py',\n",
143 |     "                       source_dir='training_script',\n",
144 |     "                       role=role,\n",
145 |     "                       framework_version='1.14.0',\n",
146 |     "                       py_version='py3',\n",
147 |     "                       script_mode=True,                       \n",
148 |     "                       hyperparameters={'epochs': 5},\n",
149 |     "                       train_instance_count=1,\n",
150 |     "                       train_instance_type='ml.p2.xlarge', \n",
151 |     "                       metric_definitions=metric_definitions, # 1_Monitoring_your_TensorFlow_scripts.ipynb 참조                       \n",
152 |     "                       input_mode='Pipe' # 추가\n",
153 |     "                      )"
154 |    ]
155 |   },
156 |   {
157 |    "cell_type": "code",
158 |    "execution_count": null,
159 |    "metadata": {},
160 |    "outputs": [],
161 |    "source": [
162 |     "%%time\n",
163 |     "estimator.fit({'train':'{}/train'.format(dataset_location),\n",
164 |     "              'validation':'{}/validation'.format(dataset_location),\n",
165 |     "              'eval':'{}/eval'.format(dataset_location)})"
166 |    ]
167 |   },
168 |   {
169 |    "cell_type": "markdown",
170 |    "metadata": {},
171 |    "source": [
172 |     "**잘 하셨습니다.** \n",
173 |     "\n",
174 |     "본 실습에서 `PipeModeDataset`을 정의해 학습해 보았습니다. 큰 데이터셋에서 Pipe 모드를 사용하면 학습 시간이 감소하고 노트북 인스턴스의 로컬 디스크 공간도 절약할 수 있습니다. <br>\n",
175 |     "다음 노트북으로 계속 진행하기 전에 CloudWatch 및 TensorBoard의 Pipe 모드 작업 지표를 살펴보세요."
176 |    ]
177 |   }
178 |  ],
179 |  "metadata": {
180 |   "kernelspec": {
181 |    "display_name": "conda_tensorflow_p36",
182 |    "language": "python",
183 |    "name": "conda_tensorflow_p36"
184 |   },
185 |   "language_info": {
186 |    "codemirror_mode": {
187 |     "name": "ipython",
188 |     "version": 3
189 |    },
190 |    "file_extension": ".py",
191 |    "mimetype": "text/x-python",
192 |    "name": "python",
193 |    "nbconvert_exporter": "python",
194 |    "pygments_lexer": "ipython3",
195 |    "version": "3.6.5"
196 |   }
197 |  },
198 |  "nbformat": 4,
199 |  "nbformat_minor": 2
200 | }
201 | 


--------------------------------------------------------------------------------
/3_Distributed_training_with_Horovod.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# [Module 4] Distributed training with horovod\n",
  8 |     "\n",
  9 |     "Horovod는 MPI(Message Passing Interface; 메세지 전달 인터페이스)를 기반으로 하는 분산 학습 프레임워크(distributed training framework)입니다. Horovod는 TensorFlow 버전 1.12 이상에서만 사용할 수 있습니다. 자세한 내용은 [Horovod README](https://github.com/uber/horovod)에서 확인할 수 있습니다.\n",
 10 |     "\n",
 11 |     "Horovod를 활성화하려면 학습 스크립트를 약간 수정해야 합니다. 본 실습에서 이를 직접 수행해 보겠습니다."
 12 |    ]
 13 |   },
 14 |   {
 15 |    "cell_type": "markdown",
 16 |    "metadata": {},
 17 |    "source": [
 18 |     "## Create a training script that support Horovod distributed training\n",
 19 |     "\n",
 20 |     "`training_script/cifar10_keras_sm.py`의 사본을 생성 후 **<font color='red'>(주의: `training_script/cifar10_keras_pipe.py`의 사본이 아닙니다)</font>**, `training_script/cifar10_keras_dist.py` 로 저장하세요.\n",
 21 |     "\n",
 22 |     "스크립트 사본을 생성하였다면 단계별로 아래의 작업들을 직접 시도합니다.\n",
 23 |     "\n",
 24 |     "----\n",
 25 |     "### TODO 1. Start horovod\n",
 26 |     "Horovod에 대응하기 위해 `main()` 함수에 아래 코드를 추가합니다.\n",
 27 |     "\n",
 28 |     "```python\n",
 29 |     "    import horovod.keras as hvd\n",
 30 |     "    hvd.init()\n",
 31 |     "    config = tf.ConfigProto()\n",
 32 |     "    config.gpu_options.allow_growth = True\n",
 33 |     "    config.gpu_options.visible_device_list = str(hvd.local_rank())\n",
 34 |     "    K.set_session(tf.Session(config=config))\n",
 35 |     "```\n",
 36 |     "\n",
 37 |     "----\n",
 38 |     "### TODO 2. Configure callbacks\n",
 39 |     "`main()` 함수에서 callbacks을 추가합니다.\n",
 40 |     "\n",
 41 |     "```python\n",
 42 |     "    callbacks.append(hvd.callbacks.BroadcastGlobalVariablesCallback(0))\n",
 43 |     "    callbacks.append(hvd.callbacks.MetricAverageCallback())\n",
 44 |     "    callbacks.append(hvd.callbacks.LearningRateWarmupCallback(warmup_epochs=5, verbose=1))\n",
 45 |     "```\n",
 46 |     "\n",
 47 |     "`hvd.rank () == 0` 에서만 실행되도록 체크포인트 및 TensorBoard 콜백을 변경해 주세요.\n",
 48 |     "```python\n",
 49 |     "    if hvd.rank() == 0:\n",
 50 |     "        callbacks.append(ModelCheckpoint(args.output_dir + '/checkpoint-{epoch}.h5'))\n",
 51 |     "```\n",
 52 |     "\n",
 53 |     "----\n",
 54 |     "### TODO 3. Configure the optimizer\n",
 55 |     "Horovod에 대응하기 위해 아래의 절차들을 진행합니다.\n",
 56 |     "\n",
 57 |     "\n",
 58 |     "1) `keras_model_fn` 함수에 hvd 인수를 추가합니다.\n",
 59 |     "```python\n",
 60 |     "# Add hvd to the function. also add it in the function call\n",
 61 |     "def keras_model_fn(learning_rate, weight_decay, optimizer, momentum, hvd): \n",
 62 |     "```\n",
 63 |     "\n",
 64 |     "2) `size=1`을 `size=hvd.size()`로 변경해 주세요.\n",
 65 |     "\n",
 66 |     "3) 코드를 아래와 같이 수정합니다.\n",
 67 |     "\n",
 68 |     "```python\n",
 69 |     " model.compile(loss='categorical_crossentropy',\n",
 70 |     "                  optimizer=opt,\n",
 71 |     "                  metrics=['accuracy'])\n",
 72 |     "```\n",
 73 |     "바로 앞에\n",
 74 |     "```python\n",
 75 |     "opt = hvd.DistributedOptimizer(opt)\n",
 76 |     "```\n",
 77 |     "라인을 추가해 주세요.\n",
 78 |     "\n",
 79 |     "4) `main()` 함수에서 model 인스턴스를 만들 때 hvd를 인수로 전달하도록 수정합니다.\n",
 80 |     "\n",
 81 |     "```python\n",
 82 |     "model = keras_model_fn(args.learning_rate, args.weight_decay, args.optimizer, args.momentum, hvd)\n",
 83 |     "```\n",
 84 |     "\n",
 85 |     "<font color='blue'>**본 노트북 실습에 어려움이 있다면 솔루션 파일 `training_script/cifar10_keras_dist_solution.py`을 참조하시면 됩니다.**</font>"
 86 |    ]
 87 |   },
 88 |   {
 89 |    "cell_type": "markdown",
 90 |    "metadata": {},
 91 |    "source": [
 92 |     "## Run Distributed training\n",
 93 |     "아래의 설정을 Estimator 객체에 전달하여 Horovod 분산 학습에 대한 설정을 할 수 있습니다.\n",
 94 |     "```python\n",
 95 |     "distributions = {'mpi': {\n",
 96 |     "                    'enabled': True,\n",
 97 |     "                    'processes_per_host': # Number of Horovod processes per host\n",
 98 |     "                        }\n",
 99 |     "                }\n",
100 |     "```"
101 |    ]
102 |   },
103 |   {
104 |    "cell_type": "code",
105 |    "execution_count": null,
106 |    "metadata": {},
107 |    "outputs": [],
108 |    "source": [
109 |     "import os\n",
110 |     "import sagemaker\n",
111 |     "from sagemaker import get_execution_role\n",
112 |     "\n",
113 |     "sagemaker_session = sagemaker.Session()\n",
114 |     "\n",
115 |     "role = get_execution_role()"
116 |    ]
117 |   },
118 |   {
119 |    "cell_type": "code",
120 |    "execution_count": null,
121 |    "metadata": {},
122 |    "outputs": [],
123 |    "source": [
124 |     "prefix = 'data/DEMO-cifar10'\n",
125 |     "dataset_location = os.path.join('s3://', sagemaker_session.default_bucket(), prefix)"
126 |    ]
127 |   },
128 |   {
129 |    "cell_type": "code",
130 |    "execution_count": null,
131 |    "metadata": {},
132 |    "outputs": [],
133 |    "source": [
134 |     "metric_definitions = [\n",
135 |     "    {'Name': 'train:loss', 'Regex': 'loss: (.*?) '},\n",
136 |     "    {'Name': 'train:accuracy', 'Regex': 'acc: (.*?) '},\n",
137 |     "    {'Name': 'validation:loss', 'Regex': 'val_loss: (.*?) '},\n",
138 |     "    {'Name': 'validation:accuracy', 'Regex': 'val_acc: (.*?) '}\n",
139 |     "]"
140 |    ]
141 |   },
142 |   {
143 |    "cell_type": "markdown",
144 |    "metadata": {},
145 |    "source": [
146 |     "`train_instance_count` 인자값을 2로 설정하고 `distribution` 인자값을 추가합니다.<br>\n",
147 |     "이번에는 5 epoch 대신 10 epoch를 학습합니다."
148 |    ]
149 |   },
150 |   {
151 |    "cell_type": "code",
152 |    "execution_count": null,
153 |    "metadata": {},
154 |    "outputs": [],
155 |    "source": [
156 |     "from sagemaker.tensorflow import TensorFlow\n",
157 |     "\n",
158 |     "distributions = {\n",
159 |     "    'mpi': {\n",
160 |     "        'enabled': True, \n",
161 |     "        'custom_mpi_options': '-verbose --NCCL_DEBUG=INFO',\n",
162 |     "        'processes_per_host': 1\n",
163 |     "    }\n",
164 |     "}\n",
165 |     "\n",
166 |     "# Change base_job_name to 'cifar10-dist' for console visibility\n",
167 |     "estimator = TensorFlow(base_job_name='cifar10-dist',\n",
168 |     "                       entry_point='cifar10_keras_dist.py',\n",
169 |     "                       source_dir='training_script',\n",
170 |     "                       role=role,\n",
171 |     "                       framework_version='1.14.0',\n",
172 |     "                       py_version='py3',\n",
173 |     "                       script_mode=True,                            \n",
174 |     "                       hyperparameters={'epochs': 10},\n",
175 |     "                       train_instance_count=2,   # 변경\n",
176 |     "                       train_instance_type='ml.p2.xlarge',\n",
177 |     "                       metric_definitions=metric_definitions, # 1_Monitoring_your_TensorFlow_scripts.ipynb 참조                         \n",
178 |     "                       distributions=distributions # 추가\n",
179 |     "                      )"
180 |    ]
181 |   },
182 |   {
183 |    "cell_type": "markdown",
184 |    "metadata": {},
185 |    "source": [
186 |     "학습 완료 후 Billable seconds도 확인해 보세요. Billable seconds는 실제로 학습 수행 시 과금되는 시간입니다.\n",
187 |     "```\n",
188 |     "Billable seconds: <time>\n",
189 |     "```\n",
190 |     "\n",
191 |     "참고로, `ml.p2.xlarge` 인스턴스로 10 epoch 학습 시 전체 6분~7분이 소요되고, 실제 학습에 소요되는 시간은 3분~4분이 소요됩니다."
192 |    ]
193 |   },
194 |   {
195 |    "cell_type": "code",
196 |    "execution_count": null,
197 |    "metadata": {},
198 |    "outputs": [],
199 |    "source": [
200 |     "%%time\n",
201 |     "estimator.fit({'train':'{}/train'.format(dataset_location),\n",
202 |     "              'validation':'{}/validation'.format(dataset_location),\n",
203 |     "              'eval':'{}/eval'.format(dataset_location)})"
204 |    ]
205 |   },
206 |   {
207 |    "cell_type": "markdown",
208 |    "metadata": {},
209 |    "source": [
210 |     "**잘 하셨습니다.**  \n",
211 |     "\n",
212 |     "여러 분은 이제 분산 학습에 SageMaker 학습 작업을 사용할 수 있습니다.\n",
213 |     "다음 노트북으로 계속 진행하기 전에 CloudWatch 및 TensorBoard의 distribution job metrics를 살펴 보세요.\n",
214 |     "TensorBoard를 사용하여 여러분이 실행한 다른 작업을 비교할 수 있습니다.\n",
215 |     "\n",
216 |     "TensorBoard 실행 시 아래의 인자값을 참조해 주세요.<br>\n",
217 |     "`--logdir dist:dist_model_dir,pipe:pipe_model_dir,file:normal_job_model_dir`"
218 |    ]
219 |   }
220 |  ],
221 |  "metadata": {
222 |   "kernelspec": {
223 |    "display_name": "conda_tensorflow_p36",
224 |    "language": "python",
225 |    "name": "conda_tensorflow_p36"
226 |   },
227 |   "language_info": {
228 |    "codemirror_mode": {
229 |     "name": "ipython",
230 |     "version": 3
231 |    },
232 |    "file_extension": ".py",
233 |    "mimetype": "text/x-python",
234 |    "name": "python",
235 |    "nbconvert_exporter": "python",
236 |    "pygments_lexer": "ipython3",
237 |    "version": "3.6.5"
238 |   }
239 |  },
240 |  "nbformat": 4,
241 |  "nbformat_minor": 4
242 | }
243 | 


--------------------------------------------------------------------------------
/CODE_OF_CONDUCT.md:
--------------------------------------------------------------------------------
1 | ## Code of Conduct
2 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct).
3 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact
4 | opensource-codeofconduct@amazon.com with any additional questions or comments.
5 | 


--------------------------------------------------------------------------------
/CONTRIBUTING.md:
--------------------------------------------------------------------------------
 1 | # Contributing Guidelines
 2 | 
 3 | Thank you for your interest in contributing to our project. Whether it's a bug report, new feature, correction, or additional
 4 | documentation, we greatly value feedback and contributions from our community.
 5 | 
 6 | Please read through this document before submitting any issues or pull requests to ensure we have all the necessary
 7 | information to effectively respond to your bug report or contribution.
 8 | 
 9 | 
10 | ## Reporting Bugs/Feature Requests
11 | 
12 | We welcome you to use the GitHub issue tracker to report bugs or suggest features.
13 | 
14 | When filing an issue, please check [existing open](https://github.com/aws-samples/amazon-sagemaker-workshop-with-tensorflow/issues), or [recently closed](https://github.com/aws-samples/amazon-sagemaker-workshop-with-tensorflow/issues?utf8=%E2%9C%93&q=is%3Aissue%20is%3Aclosed%20), issues to make sure somebody else hasn't already
15 | reported the issue. Please try to include as much information as you can. Details like these are incredibly useful:
16 | 
17 | * A reproducible test case or series of steps
18 | * The version of our code being used
19 | * Any modifications you've made relevant to the bug
20 | * Anything unusual about your environment or deployment
21 | 
22 | 
23 | ## Contributing via Pull Requests
24 | Contributions via pull requests are much appreciated. Before sending us a pull request, please ensure that:
25 | 
26 | 1. You are working against the latest source on the *master* branch.
27 | 2. You check existing open, and recently merged, pull requests to make sure someone else hasn't addressed the problem already.
28 | 3. You open an issue to discuss any significant work - we would hate for your time to be wasted.
29 | 
30 | To send us a pull request, please:
31 | 
32 | 1. Fork the repository.
33 | 2. Modify the source; please focus on the specific change you are contributing. If you also reformat all the code, it will be hard for us to focus on your change.
34 | 3. Ensure local tests pass.
35 | 4. Commit to your fork using clear commit messages.
36 | 5. Send us a pull request, answering any default questions in the pull request interface.
37 | 6. Pay attention to any automated CI failures reported in the pull request, and stay involved in the conversation.
38 | 
39 | GitHub provides additional document on [forking a repository](https://help.github.com/articles/fork-a-repo/) and
40 | [creating a pull request](https://help.github.com/articles/creating-a-pull-request/).
41 | 
42 | 
43 | ## Finding contributions to work on
44 | Looking at the existing issues is a great way to find something to contribute on. As our projects, by default, use the default GitHub issue labels (enhancement/bug/duplicate/help wanted/invalid/question/wontfix), looking at any ['help wanted'](https://github.com/aws-samples/amazon-sagemaker-workshop-with-tensorflow/labels/help%20wanted) issues is a great place to start.
45 | 
46 | 
47 | ## Code of Conduct
48 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct).
49 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact
50 | opensource-codeofconduct@amazon.com with any additional questions or comments.
51 | 
52 | 
53 | ## Security issue notifications
54 | If you discover a potential security issue in this project we ask that you notify AWS/Amazon Security via our [vulnerability reporting page](http://aws.amazon.com/security/vulnerability-reporting/). Please do **not** create a public github issue.
55 | 
56 | 
57 | ## Licensing
58 | 
59 | See the [LICENSE](https://github.com/aws-samples/amazon-sagemaker-workshop-with-tensorflow/blob/master/LICENSE) file for our project's licensing. We will ask you to confirm the licensing of your contribution.
60 | 
61 | We may ask you to sign a [Contributor License Agreement (CLA)](http://en.wikipedia.org/wiki/Contributor_License_Agreement) for larger changes.
62 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | Copyright 2019 Amazon.com, Inc. or its affiliates. All Rights Reserved.
 2 | 
 3 | Permission is hereby granted, free of charge, to any person obtaining a copy of
 4 | this software and associated documentation files (the "Software"), to deal in
 5 | the Software without restriction, including without limitation the rights to
 6 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
 7 | the Software, and to permit persons to whom the Software is furnished to do so.
 8 | 
 9 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
10 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
11 | FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
12 | COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
13 | IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
14 | CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
15 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # SageMaker Workshop: Tensorflow-Keras 모델을 Amazon SageMaker에서 학습하기
 2 | ### 본 문서는 [Running your TensorFlow Models in SageMaker Workshop](https://github.com/aws-samples/TensorFlow-in-SageMaker-workshop) 의 한국어 버전이며, 원본과 다른 점들은 아래와 같습니다.
 3 | - Minor한 오타 수정
 4 | - 보충 설명 대폭 추가
 5 | - 솔루션 코드 포함 (원본 버전은 코드 솔루션이 제공되지 않습니다.)
 6 | - TensorFlow 1.14 및 2.0 대응 (원본은 TensorFlow 1.12 대응)
 7 | 
 8 | 
 9 | ## Introduction
10 | 
11 | TensorFlow™를 통해 개발자는 클라우드에서 딥러닝을 쉽고 빠르게 시작할 수 있습니다.
12 | 이 프레임워크는 다양한 산업 분아에서 사용되고 있으며 특히 컴퓨터 비전, 자연어 이해 및 음성 번역과 같은 영역에서 딥러닝 연구 및 응용 프로그램 개발에 널리 사용됩니다.
13 | 머신 러닝 모델을 대규모로 구축, 학습 및 배포 할 수있는 플랫폼인 Amazon SageMaker를 통해 완전히 관리되는(fully-managed) TensorFlow 환경에서 AWS를 시작할 수 있습니다.
14 | 
15 | ## Use Machine Learning Frameworks with Amazon SageMaker
16 | 
17 | Amazon SageMaker Python SDK는 다양한 머신러닝 및 딥러닝 프레임워크(framework)를 사용하여 Amazon SageMaker에서 모델을 쉽게 학습하고 배포할 수 있는 오픈 소스 API 및 컨테이너(containers)를 제공합니다. Amazon SageMaker Python SDK에 대한 일반적인 정보는 https://sagemaker.readthedocs.io/ 를 참조하세요.
18 | 
19 | Amazon SageMaker를 사용하여 사용자 지정 TensorFlow 코드를 사용하여 모델을 학습하고 배포할 수 있습니다. Amazon SageMaker Python SDK TensorFlow Estimator 및 model과 Amazon SageMaker 오픈 소스 TensorFlow 컨테이너를 사용하면 TensorFlow 스크립트를 작성하고 Amazon SageMaker에서 쉽게 실행할 수 있습니다.
20 | 
21 | 이 워크샵에서는 TensorFlow 샘플 코드를 Amazon SageMaker에서 실행하는 방법을 소개합니다. 
22 | SageMaker Python SDK에서 TensorFlow를 사용하기 위한 자세한 정보는 API references를 참조해 주세요.
23 | 
24 | 워크샵은 아래 5개의 모듈로 이루어져 있습니다.
25 | 
26 | 1. [Porting a TensorFlow script to run in SageMaker using SageMaker script mode.](0_Running_TensorFlow_In_SageMaker.ipynb)
27 |     - [Porting a TensorFlow script to run in SageMaker using SageMaker script mode. (TensorFlow 2.0)](0_Running_TensorFlow_In_SageMaker_tf2.ipynb)
28 | 2. [Monitoring your training job using TensorBoard and Amazon CloudWatch metrics.](1_Monitoring_your_TensorFlow_scripts.ipynb)
29 | 3. [Optimizing your training job using SageMaker pipemode input.](2_Using_Pipemode_input_for_big_datasets.ipynb)
30 | 4. [Running a distributed training job.](3_Distributed_training_with_Horovod.ipynb)
31 | 5. [Deploying your trained model on Amazon SageMaker.](4_Deploying_your_TensorFlow_model.ipynb)
32 | 
33 | 
34 | ## License Summary
35 | 
36 | 이 샘플 코드는 MIT-0 라이센스에 따라 제공됩니다. LICENSE 파일을 참조하십시오.
37 | 


--------------------------------------------------------------------------------
/images/TFRecord.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/daekeun-ml/tensorflow-in-sagemaker-workshop/45f07ec6fff1ac36d4d50b08aaa7887d9086499b/images/TFRecord.png


--------------------------------------------------------------------------------
/images/tensorboard.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/daekeun-ml/tensorflow-in-sagemaker-workshop/45f07ec6fff1ac36d4d50b08aaa7887d9086499b/images/tensorboard.png


--------------------------------------------------------------------------------
/local_mode_setup.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | # Do we have GPU support?
 4 | nvidia-smi > /dev/null 2>&1
 5 | if [ $? -eq 0 ]; then
 6 |   # check if we have nvidia-docker
 7 |   NVIDIA_DOCKER=`rpm -qa | grep -c nvidia-docker2`
 8 |   if [ $NVIDIA_DOCKER -eq 0 ]; then
 9 |     # Install nvidia-docker2
10 |     DOCKER_VERSION=`yum list docker | tail -1 | awk '{print $2}' | head -c 2`
11 | 
12 |     if [ $DOCKER_VERSION -eq 17 ]; then
13 |       DOCKER_PKG_VERSION='17.09.1ce-1.111.amzn1'
14 |       NVIDIA_DOCKER_PKG_VERSION='2.0.3-1.docker17.09.1.ce.amzn1'
15 |     else
16 |       DOCKER_PKG_VERSION='18.06.1ce-3.17.amzn1'
17 |       NVIDIA_DOCKER_PKG_VERSION='2.0.3-1.docker18.06.1.ce.amzn1'
18 |     fi
19 | 
20 |     sudo yum -y remove docker
21 |     sudo yum -y install docker-$DOCKER_PKG_VERSION
22 | 
23 |     sudo /etc/init.d/docker start
24 | 
25 |     curl -s -L https://nvidia.github.io/nvidia-docker/amzn1/nvidia-docker.repo | sudo tee /etc/yum.repos.d/nvidia-docker.repo
26 |     sudo yum install -y nvidia-docker2-$NVIDIA_DOCKER_PKG_VERSION
27 |     sudo cp daemon.json /etc/docker/daemon.json
28 |     sudo pkill -SIGHUP dockerd
29 |     echo "installed nvidia-docker2"
30 |   else
31 |     echo "nvidia-docker2 already installed. We are good to go!"
32 |   fi
33 | fi
34 | 
35 | # This is common for both GPU and CPU instances
36 | 
37 | # check if we have docker-compose
38 | docker-compose version >/dev/null 2>&1
39 | if [ $? -ne 0 ]; then
40 |   # install docker compose
41 |   pip install docker-compose
42 | fi
43 | 
44 | # check if we need to configure our docker interface
45 | SAGEMAKER_NETWORK=`docker network ls | grep -c sagemaker-local`
46 | if [ $SAGEMAKER_NETWORK -eq 0 ]; then
47 |   docker network create --driver bridge sagemaker-local
48 | fi
49 | 
50 | # Notebook instance Docker networking fixes
51 | RUNNING_ON_NOTEBOOK_INSTANCE=`sudo iptables -S OUTPUT -t nat | grep -c 169.254.0.2`
52 | 
53 | # Get the Docker Network CIDR and IP for the sagemaker-local docker interface.
54 | SAGEMAKER_INTERFACE=br-`docker network ls | grep sagemaker-local | cut -d' ' -f1`
55 | DOCKER_NET=`ip route | grep $SAGEMAKER_INTERFACE | cut -d" " -f1`
56 | DOCKER_IP=`ip route | grep $SAGEMAKER_INTERFACE | cut -d" " -f12`
57 | 
58 | # check if both IPTables and the Route Table are OK.
59 | IPTABLES_PATCHED=`sudo iptables -S PREROUTING -t nat | grep -c $SAGEMAKER_INTERFACE`
60 | ROUTE_TABLE_PATCHED=`sudo ip route show table agent | grep -c $SAGEMAKER_INTERFACE`
61 | 
62 | if [ $RUNNING_ON_NOTEBOOK_INSTANCE -gt 0 ]; then
63 | 
64 |   if [ $ROUTE_TABLE_PATCHED -eq 0 ]; then
65 |     # fix routing
66 |     sudo ip route add $DOCKER_NET via $DOCKER_IP dev $SAGEMAKER_INTERFACE table agent
67 |   else
68 |     echo "SageMaker instance route table setup is ok. We are good to go."
69 |   fi
70 | 
71 |   if [ $IPTABLES_PATCHED -eq 0 ]; then
72 |     sudo iptables -t nat -A PREROUTING  -i $SAGEMAKER_INTERFACE -d 169.254.169.254/32 -p tcp -m tcp --dport 80 -j DNAT --to-destination 169.254.0.2:9081
73 |     echo "iptables for Docker setup done"
74 |   else
75 |     echo "SageMaker instance routing for Docker is ok. We are good to go!"
76 |   fi
77 | fi
78 | 


--------------------------------------------------------------------------------
/training_script/cifar10_keras.py:
--------------------------------------------------------------------------------
  1 | # Copyright 2019 Amazon.com, Inc. or its affiliates. All Rights Reserved.
  2 | #
  3 | # Permission is hereby granted, free of charge, to any person obtaining a copy of this
  4 | # software and associated documentation files (the "Software"), to deal in the Software
  5 | # without restriction, including without limitation the rights to use, copy, modify,
  6 | # merge, publish, distribute, sublicense, and/or sell copies of the Software, and to
  7 | # permit persons to whom the Software is furnished to do so.
  8 | #
  9 | # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED,
 10 | # INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A
 11 | # PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
 12 | # HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
 13 | # OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
 14 | # SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
 15 | 
 16 | from __future__ import absolute_import
 17 | from __future__ import division
 18 | from __future__ import print_function
 19 | 
 20 | import argparse
 21 | import logging
 22 | import os
 23 | 
 24 | from keras.callbacks import ModelCheckpoint
 25 | from keras.layers import Activation, Conv2D, Dense, Dropout, Flatten, MaxPooling2D, BatchNormalization
 26 | from keras.models import Sequential
 27 | from keras.optimizers import Adam, SGD, RMSprop
 28 | import tensorflow as tf
 29 | from keras import backend as K
 30 | 
 31 | sess = tf.Session()
 32 | K.set_session(sess)
 33 | 
 34 | logging.getLogger().setLevel(logging.INFO)
 35 | tf.logging.set_verbosity(tf.logging.INFO)
 36 | HEIGHT = 32
 37 | WIDTH = 32
 38 | DEPTH = 3
 39 | NUM_CLASSES = 10
 40 | NUM_DATA_BATCHES = 5
 41 | NUM_EXAMPLES_PER_EPOCH_FOR_TRAIN = 10000 * NUM_DATA_BATCHES
 42 | INPUT_TENSOR_NAME = 'inputs_input'  # needs to match the name of the first layer + "_input"
 43 | 
 44 | def keras_model_fn(learning_rate, weight_decay, optimizer, momentum):
 45 |     """keras_model_fn receives hyperparameters from the training job and returns a compiled keras model.
 46 |     The model will be transformed into a TensorFlow Estimator before training and it will be saved in a 
 47 |     TensorFlow Serving SavedModel at the end of training.
 48 | 
 49 |     Args:
 50 |         hyperparameters: The hyperparameters passed to the SageMaker TrainingJob that runs your TensorFlow 
 51 |                          training script.
 52 |     Returns: A compiled Keras model
 53 |     """
 54 |     model = Sequential()
 55 |     model.add(Conv2D(32, (3, 3), padding='same', name='inputs', input_shape=(HEIGHT, WIDTH, DEPTH)))
 56 |     model.add(BatchNormalization())
 57 |     model.add(Activation('relu'))
 58 |     model.add(Conv2D(32, (3, 3)))
 59 |     model.add(BatchNormalization())
 60 |     model.add(Activation('relu'))
 61 |     model.add(MaxPooling2D(pool_size=(2, 2)))
 62 |     model.add(Dropout(0.2))
 63 | 
 64 |     model.add(Conv2D(64, (3, 3), padding='same'))
 65 |     model.add(BatchNormalization())
 66 |     model.add(Activation('relu'))
 67 |     model.add(Conv2D(64, (3, 3)))
 68 |     model.add(BatchNormalization())
 69 |     model.add(Activation('relu'))
 70 |     model.add(MaxPooling2D(pool_size=(2, 2)))
 71 |     model.add(Dropout(0.3))
 72 | 
 73 |     model.add(Conv2D(128, (3, 3), padding='same'))
 74 |     model.add(BatchNormalization())
 75 |     model.add(Activation('relu'))
 76 |     model.add(Conv2D(128, (3, 3)))
 77 |     model.add(BatchNormalization())
 78 |     model.add(Activation('relu'))
 79 |     model.add(MaxPooling2D(pool_size=(2, 2)))
 80 |     model.add(Dropout(0.4))
 81 | 
 82 |     model.add(Flatten())
 83 |     model.add(Dense(512))
 84 |     model.add(Activation('relu'))
 85 |     model.add(Dropout(0.5))
 86 |     model.add(Dense(NUM_CLASSES))
 87 |     model.add(Activation('softmax'))
 88 | 
 89 |     size = 1
 90 | 
 91 |     if optimizer.lower() == 'sgd':
 92 |         opt = SGD(lr=learning_rate * size, decay=weight_decay, momentum=momentum)
 93 |     elif optimizer.lower() == 'rmsprop':
 94 |         opt = RMSprop(lr=learning_rate * size, decay=weight_decay)
 95 |     else:
 96 |         opt = Adam(lr=learning_rate * size, decay=weight_decay)
 97 | 
 98 |     model.compile(loss='categorical_crossentropy',
 99 |                   optimizer=opt,
100 |                   metrics=['accuracy'])
101 |     return model
102 | 
103 | 
104 | def get_filenames(channel_name, channel):
105 |     if channel_name in ['train', 'validation', 'eval']:
106 |         return [os.path.join(channel, channel_name + '.tfrecords')]
107 |     else:
108 |         raise ValueError('Invalid data subset "%s"' % channel_name)
109 | 
110 | 
111 | def train_input_fn():
112 |     return _input(args.epochs, args.batch_size, args.train, 'train')
113 | 
114 | 
115 | def eval_input_fn():
116 |     return _input(args.epochs, args.batch_size, args.eval, 'eval')
117 | 
118 | 
119 | def validation_input_fn():
120 |     return _input(args.epochs, args.batch_size, args.validation, 'validation')
121 | 
122 | 
123 | def _input(epochs, batch_size, channel, channel_name):
124 | 
125 |     filenames = get_filenames(channel_name, channel)
126 |     dataset = tf.data.TFRecordDataset(filenames)
127 | 
128 |     dataset = dataset.repeat(epochs)
129 |     dataset = dataset.prefetch(10)
130 | 
131 |     # Parse records.
132 |     dataset = dataset.map(
133 |         _dataset_parser, num_parallel_calls=10)
134 | 
135 |     # Potentially shuffle records.
136 |     if channel_name == 'train':
137 |         # Ensure that the capacity is sufficiently large to provide good random
138 |         # shuffling.
139 |         buffer_size = int(NUM_EXAMPLES_PER_EPOCH_FOR_TRAIN * 0.4) + 3 * batch_size
140 |         dataset = dataset.shuffle(buffer_size=buffer_size)
141 | 
142 |     # Batch it up.
143 |     dataset = dataset.batch(batch_size, drop_remainder=True)
144 |     iterator = dataset.make_one_shot_iterator()
145 |     image_batch, label_batch = iterator.get_next()
146 | 
147 |     return {INPUT_TENSOR_NAME: image_batch}, label_batch
148 | 
149 | 
150 | def _train_preprocess_fn(image):
151 |     """Preprocess a single training image of layout [height, width, depth]."""
152 |     # Resize the image to add four extra pixels on each side.
153 |     image = tf.image.resize_image_with_crop_or_pad(image, HEIGHT + 8, WIDTH + 8)
154 | 
155 |     # Randomly crop a [HEIGHT, WIDTH] section of the image.
156 |     image = tf.random_crop(image, [HEIGHT, WIDTH, DEPTH])
157 | 
158 |     # Randomly flip the image horizontally.
159 |     image = tf.image.random_flip_left_right(image)
160 | 
161 |     return image
162 | 
163 | 
164 | def _dataset_parser(value):
165 |     """Parse a CIFAR-10 record from value."""
166 |     featdef = {
167 |         'image': tf.FixedLenFeature([], tf.string),
168 |         'label': tf.FixedLenFeature([], tf.int64),
169 |     }
170 | 
171 |     example = tf.parse_single_example(value, featdef)
172 |     image = tf.decode_raw(example['image'], tf.uint8)
173 |     image.set_shape([DEPTH * HEIGHT * WIDTH])
174 | 
175 |     # Reshape from [depth * height * width] to [depth, height, width].
176 |     image = tf.cast(
177 |         tf.transpose(tf.reshape(image, [DEPTH, HEIGHT, WIDTH]), [1, 2, 0]),
178 |         tf.float32)
179 |     label = tf.cast(example['label'], tf.int32)
180 |     image = _train_preprocess_fn(image)
181 |     return image, tf.one_hot(label, NUM_CLASSES)
182 | 
183 | def save_model(model, output):
184 |     signature = tf.saved_model.signature_def_utils.predict_signature_def(
185 |         inputs={'inputs': model.input}, outputs={'scores': model.output})
186 | 
187 |     builder = tf.saved_model.builder.SavedModelBuilder(output+'/1/')
188 |     builder.add_meta_graph_and_variables(
189 |         sess=K.get_session(),
190 |         tags=[tf.saved_model.tag_constants.SERVING],
191 |         signature_def_map={"serving_default": signature})
192 |     builder.save()
193 | 
194 |     logging.info("Model successfully saved at: {}".format(output))
195 |     return
196 | 
197 | def main(args):
198 |     logging.info("getting data")
199 |     train_dataset = train_input_fn()
200 |     eval_dataset = eval_input_fn()
201 |     validation_dataset = validation_input_fn()
202 | 
203 |     logging.info("configuring model")
204 |     model = keras_model_fn(args.learning_rate, args.weight_decay, args.optimizer, args.momentum)
205 |     callbacks = []
206 | 
207 |     callbacks.append(ModelCheckpoint(args.model_dir + '/checkpoint-{epoch}.h5'))
208 | 
209 |     logging.info("Starting training")
210 |     model.fit(x=train_dataset[0], y=train_dataset[1],
211 |               steps_per_epoch=(num_examples_per_epoch('train') // args.batch_size),
212 |               epochs=args.epochs, validation_data=validation_dataset,
213 |               validation_steps=(num_examples_per_epoch('validation') // args.batch_size), callbacks=callbacks)
214 | 
215 |     score = model.evaluate(eval_dataset[0], eval_dataset[1], steps=num_examples_per_epoch('eval') // args.batch_size,
216 |                            verbose=0)
217 | 
218 |     logging.info('Test loss:{}'.format(score[0]))
219 |     logging.info('Test accuracy:{}'.format(score[1]))
220 | 
221 |     return save_model(model, args.model_dir)
222 | 
223 | def num_examples_per_epoch(subset='train'):
224 |     if subset == 'train':
225 |         return 40000
226 |     elif subset == 'validation':
227 |         return 10000
228 |     elif subset == 'eval':
229 |         return 10000
230 |     else:
231 |         raise ValueError('Invalid data subset "%s"' % subset)
232 | 
233 | 
234 | if __name__ == '__main__':
235 |     parser = argparse.ArgumentParser()
236 |     parser.add_argument(
237 |         '--train',
238 |         type=str,
239 |         required=False,
240 |         help='The directory where the CIFAR-10 input data is stored.')
241 |     parser.add_argument(
242 |         '--validation',
243 |         type=str,
244 |         required=False,
245 |         help='The directory where the CIFAR-10 input data is stored.')
246 |     parser.add_argument(
247 |         '--eval',
248 |         type=str,
249 |         required=False,
250 |         help='The directory where the CIFAR-10 input data is stored.')
251 |     parser.add_argument(
252 |         '--model_dir',
253 |         type=str,
254 |         required=True,
255 |         help='The directory where the model will be stored.')
256 |     parser.add_argument(
257 |         '--weight-decay',
258 |         type=float,
259 |         default=2e-4,
260 |         help='Weight decay for convolutions.')
261 |     parser.add_argument(
262 |         '--learning-rate',
263 |         type=float,
264 |         default=0.001,
265 |         help="""\
266 |         This is the inital learning rate value. The learning rate will decrease
267 |         during training. For more details check the model_fn implementation in
268 |         this file.\
269 |         """)
270 |     parser.add_argument(
271 |         '--epochs',
272 |         type=int,
273 |         default=10,
274 |         help='The number of steps to use for training.')
275 |     parser.add_argument(
276 |         '--batch-size',
277 |         type=int,
278 |         default=128,
279 |         help='Batch size for training.')
280 |     parser.add_argument(
281 |         '--optimizer',
282 |         type=str,
283 |         default='adam')
284 |     parser.add_argument(
285 |         '--momentum',
286 |         type=float,
287 |         default='0.9')
288 |     args = parser.parse_args()
289 |     main(args)


--------------------------------------------------------------------------------
/training_script/cifar10_keras_dist_solution.py:
--------------------------------------------------------------------------------
  1 | # Copyright 2019 Amazon.com, Inc. or its affiliates. All Rights Reserved.
  2 | #
  3 | # Permission is hereby granted, free of charge, to any person obtaining a copy of this
  4 | # software and associated documentation files (the "Software"), to deal in the Software
  5 | # without restriction, including without limitation the rights to use, copy, modify,
  6 | # merge, publish, distribute, sublicense, and/or sell copies of the Software, and to
  7 | # permit persons to whom the Software is furnished to do so.
  8 | #
  9 | # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED,
 10 | # INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A
 11 | # PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
 12 | # HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
 13 | # OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
 14 | # SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
 15 | 
 16 | from __future__ import absolute_import
 17 | from __future__ import division
 18 | from __future__ import print_function
 19 | 
 20 | import argparse
 21 | import logging
 22 | import os
 23 | 
 24 | from keras.callbacks import ModelCheckpoint
 25 | from keras.layers import Activation, Conv2D, Dense, Dropout, Flatten, MaxPooling2D, BatchNormalization
 26 | from keras.models import Sequential
 27 | from keras.optimizers import Adam, SGD, RMSprop
 28 | import tensorflow as tf
 29 | from keras import backend as K
 30 | 
 31 | # ----- 추가 부분 -----
 32 | from keras.callbacks import TensorBoard
 33 | 
 34 | sess = tf.Session()
 35 | K.set_session(sess)
 36 | 
 37 | logging.getLogger().setLevel(logging.INFO)
 38 | tf.logging.set_verbosity(tf.logging.INFO)
 39 | HEIGHT = 32
 40 | WIDTH = 32
 41 | DEPTH = 3
 42 | NUM_CLASSES = 10
 43 | NUM_DATA_BATCHES = 5
 44 | NUM_EXAMPLES_PER_EPOCH_FOR_TRAIN = 10000 * NUM_DATA_BATCHES
 45 | INPUT_TENSOR_NAME = 'inputs_input'  # needs to match the name of the first layer + "_input"
 46 | 
 47 | # ----- 수정 부분 -----
 48 | #def keras_model_fn(learning_rate, weight_decay, optimizer, momentum):
 49 | def keras_model_fn(learning_rate, weight_decay, optimizer, momentum, hvd):
 50 |     """keras_model_fn receives hyperparameters from the training job and returns a compiled keras model.
 51 |     The model will be transformed into a TensorFlow Estimator before training and it will be saved in a 
 52 |     TensorFlow Serving SavedModel at the end of training.
 53 | 
 54 |     Args:
 55 |         hyperparameters: The hyperparameters passed to the SageMaker TrainingJob that runs your TensorFlow 
 56 |                          training script.
 57 |     Returns: A compiled Keras model
 58 |     """
 59 |     model = Sequential()
 60 |     model.add(Conv2D(32, (3, 3), padding='same', name='inputs', input_shape=(HEIGHT, WIDTH, DEPTH)))
 61 |     model.add(BatchNormalization())
 62 |     model.add(Activation('relu'))
 63 |     model.add(Conv2D(32, (3, 3)))
 64 |     model.add(BatchNormalization())
 65 |     model.add(Activation('relu'))
 66 |     model.add(MaxPooling2D(pool_size=(2, 2)))
 67 |     model.add(Dropout(0.2))
 68 | 
 69 |     model.add(Conv2D(64, (3, 3), padding='same'))
 70 |     model.add(BatchNormalization())
 71 |     model.add(Activation('relu'))
 72 |     model.add(Conv2D(64, (3, 3)))
 73 |     model.add(BatchNormalization())
 74 |     model.add(Activation('relu'))
 75 |     model.add(MaxPooling2D(pool_size=(2, 2)))
 76 |     model.add(Dropout(0.3))
 77 | 
 78 |     model.add(Conv2D(128, (3, 3), padding='same'))
 79 |     model.add(BatchNormalization())
 80 |     model.add(Activation('relu'))
 81 |     model.add(Conv2D(128, (3, 3)))
 82 |     model.add(BatchNormalization())
 83 |     model.add(Activation('relu'))
 84 |     model.add(MaxPooling2D(pool_size=(2, 2)))
 85 |     model.add(Dropout(0.4))
 86 | 
 87 |     model.add(Flatten())
 88 |     model.add(Dense(512))
 89 |     model.add(Activation('relu'))
 90 |     model.add(Dropout(0.5))
 91 |     model.add(Dense(NUM_CLASSES))
 92 |     model.add(Activation('softmax'))
 93 | 
 94 |     #size = 1
 95 |     size = hvd.size() # ----- 수정 부분 -----
 96 |     
 97 |     if optimizer.lower() == 'sgd':
 98 |         opt = SGD(lr=learning_rate * size, decay=weight_decay, momentum=momentum)
 99 |     elif optimizer.lower() == 'rmsprop':
100 |         opt = RMSprop(lr=learning_rate * size, decay=weight_decay)
101 |     else:
102 |         opt = Adam(lr=learning_rate * size, decay=weight_decay)
103 | 
104 |     # ----- 추가 부분 -----
105 |     opt = hvd.DistributedOptimizer(opt)   
106 |         
107 |     model.compile(loss='categorical_crossentropy',
108 |                   optimizer=opt,
109 |                   metrics=['accuracy'])
110 |     return model
111 | 
112 | 
113 | def get_filenames(channel_name, channel):
114 |     if channel_name in ['train', 'validation', 'eval']:
115 |         return [os.path.join(channel, channel_name + '.tfrecords')]
116 |     else:
117 |         raise ValueError('Invalid data subset "%s"' % channel_name)
118 | 
119 | 
120 | def train_input_fn():
121 |     return _input(args.epochs, args.batch_size, args.train, 'train')
122 | 
123 | 
124 | def eval_input_fn():
125 |     return _input(args.epochs, args.batch_size, args.eval, 'eval')
126 | 
127 | 
128 | def validation_input_fn():
129 |     return _input(args.epochs, args.batch_size, args.validation, 'validation')
130 | 
131 | 
132 | def _input(epochs, batch_size, channel, channel_name):
133 | 
134 |     filenames = get_filenames(channel_name, channel)
135 |     dataset = tf.data.TFRecordDataset(filenames)
136 | 
137 |     dataset = dataset.repeat(epochs)
138 |     dataset = dataset.prefetch(10)
139 | 
140 |     # Parse records.
141 |     dataset = dataset.map(
142 |         _dataset_parser, num_parallel_calls=10)
143 | 
144 |     # Potentially shuffle records.
145 |     if channel_name == 'train':
146 |         # Ensure that the capacity is sufficiently large to provide good random
147 |         # shuffling.
148 |         buffer_size = int(NUM_EXAMPLES_PER_EPOCH_FOR_TRAIN * 0.4) + 3 * batch_size
149 |         dataset = dataset.shuffle(buffer_size=buffer_size)
150 | 
151 |     # Batch it up.
152 |     dataset = dataset.batch(batch_size, drop_remainder=True)
153 |     iterator = dataset.make_one_shot_iterator()
154 |     image_batch, label_batch = iterator.get_next()
155 | 
156 |     return {INPUT_TENSOR_NAME: image_batch}, label_batch
157 | 
158 | 
159 | def _train_preprocess_fn(image):
160 |     """Preprocess a single training image of layout [height, width, depth]."""
161 |     # Resize the image to add four extra pixels on each side.
162 |     image = tf.image.resize_image_with_crop_or_pad(image, HEIGHT + 8, WIDTH + 8)
163 | 
164 |     # Randomly crop a [HEIGHT, WIDTH] section of the image.
165 |     image = tf.random_crop(image, [HEIGHT, WIDTH, DEPTH])
166 | 
167 |     # Randomly flip the image horizontally.
168 |     image = tf.image.random_flip_left_right(image)
169 | 
170 |     return image
171 | 
172 | 
173 | def _dataset_parser(value):
174 |     """Parse a CIFAR-10 record from value."""
175 |     featdef = {
176 |         'image': tf.FixedLenFeature([], tf.string),
177 |         'label': tf.FixedLenFeature([], tf.int64),
178 |     }
179 | 
180 |     example = tf.parse_single_example(value, featdef)
181 |     image = tf.decode_raw(example['image'], tf.uint8)
182 |     image.set_shape([DEPTH * HEIGHT * WIDTH])
183 | 
184 |     # Reshape from [depth * height * width] to [depth, height, width].
185 |     image = tf.cast(
186 |         tf.transpose(tf.reshape(image, [DEPTH, HEIGHT, WIDTH]), [1, 2, 0]),
187 |         tf.float32)
188 |     label = tf.cast(example['label'], tf.int32)
189 |     image = _train_preprocess_fn(image)
190 |     return image, tf.one_hot(label, NUM_CLASSES)
191 | 
192 | def save_model(model, output):
193 |     signature = tf.saved_model.signature_def_utils.predict_signature_def(
194 |         inputs={'inputs': model.input}, outputs={'scores': model.output})
195 | 
196 |     builder = tf.saved_model.builder.SavedModelBuilder(output+'/1/')
197 |     builder.add_meta_graph_and_variables(
198 |         sess=K.get_session(),
199 |         tags=[tf.saved_model.tag_constants.SERVING],
200 |         signature_def_map={"serving_default": signature})
201 |     builder.save()
202 | 
203 |     logging.info("Model successfully saved at: {}".format(output))
204 |     return
205 | 
206 | def main(args):
207 |     # ----- 추가 부분 -----
208 |     import horovod.keras as hvd
209 |     hvd.init()
210 |     config = tf.ConfigProto()
211 |     config.gpu_options.allow_growth = True
212 |     config.gpu_options.visible_device_list =  str(hvd.local_rank())
213 |     K.set_session(tf.Session(config = config))
214 | 
215 |     logging.info("getting data")
216 |     train_dataset = train_input_fn()
217 |     eval_dataset = eval_input_fn()
218 |     validation_dataset = validation_input_fn()
219 | 
220 |     logging.info("configuring model")
221 |     # ----- 수정 부분 -----
222 |     model = keras_model_fn(args.learning_rate, args.weight_decay, args.optimizer, args.momentum, hvd)
223 |     callbacks = []
224 | 
225 |     callbacks.append(ModelCheckpoint(args.model_output_dir + '/checkpoint-{epoch}.h5'))
226 |     
227 |     # ----- 추가 부분 -----
228 |     callbacks.append(hvd.callbacks.BroadcastGlobalVariablesCallback(0))
229 |     callbacks.append(hvd.callbacks.MetricAverageCallback())
230 |     callbacks.append(hvd.callbacks.LearningRateWarmupCallback(warmup_epochs=5, verbose=1))
231 |     
232 |     # ----- 추가 부분 -----    
233 |     if hvd.rank () == 0 :
234 |         callbacks.append(ModelCheckpoint(args.model_output_dir + '/checkpoint-{epoch}.h5'))
235 |         callbacks.append(TensorBoard(log_dir = args.model_output_dir, update_freq = 'epoch'))
236 |         
237 |     logging.info("Starting training")
238 |     
239 |     train_steps = num_examples_per_epoch('train') // args.batch_size
240 |     train_steps = int(train_steps // hvd.size())
241 |     val_steps = num_examples_per_epoch('validation') // args.batch_size
242 |     val_steps = int(val_steps // hvd.size())
243 |     eval_steps = num_examples_per_epoch('eval') // args.batch_size
244 |     eval_steps = int(eval_steps // hvd.size())
245 | 
246 |     model.fit(x=train_dataset[0], y=train_dataset[1],
247 |               steps_per_epoch=train_steps,
248 |               epochs=int(args.epochs // hvd.size()), validation_data=validation_dataset,
249 |               validation_steps=val_steps, callbacks=callbacks)
250 | 
251 |     score = model.evaluate(eval_dataset[0], eval_dataset[1], steps=eval_steps,
252 |                            verbose=0)
253 | 
254 |     logging.info('Test loss:{}'.format(score[0]))
255 |     logging.info('Test accuracy:{}'.format(score[1]))
256 | 
257 |     return save_model(model, args.model_output_dir)
258 | 
259 | def num_examples_per_epoch(subset='train'):
260 |     if subset == 'train':
261 |         return 40000
262 |     elif subset == 'validation':
263 |         return 10000
264 |     elif subset == 'eval':
265 |         return 10000
266 |     else:
267 |         raise ValueError('Invalid data subset "%s"' % subset)
268 | 
269 |         
270 | if __name__ == '__main__':
271 |     parser = argparse.ArgumentParser()
272 | 
273 |     parser.add_argument(
274 |         '--train',
275 |         type=str,
276 |         required=False,
277 |         default=os.environ['SM_CHANNEL_TRAIN'], # ----- 수정 부분 (default 경로 수정) -----
278 |         help='The directory where the CIFAR-10 input data is stored.')    
279 |     parser.add_argument(
280 |         '--validation',
281 |         type=str,
282 |         required=False,
283 |         default=os.environ['SM_CHANNEL_VALIDATION'], # ----- 수정 부분 (default 경로 수정) -----
284 |         help='The directory where the CIFAR-10 input data is stored.')
285 |     parser.add_argument(
286 |         '--eval',
287 |         type=str,
288 |         required=False,
289 |         default=os.environ['SM_CHANNEL_EVAL'], # ----- 수정 부분 (default 경로 수정) -----
290 |         help='The directory where the CIFAR-10 input data is stored.')
291 |     
292 |     # ----- 수정 부분 (argument 추가) -----
293 |     parser.add_argument(
294 |         '--model_output_dir',
295 |         type=str,
296 |         default=os.environ.get('SM_MODEL_DIR'))  
297 |     
298 |     parser.add_argument(
299 |         '--model_dir',
300 |         type=str,
301 |         required=True,
302 |         help='The directory where the model will be stored.')
303 |     
304 |     parser.add_argument(
305 |         '--weight-decay',
306 |         type=float,
307 |         default=2e-4,
308 |         help='Weight decay for convolutions.')
309 |     parser.add_argument(
310 |         '--learning-rate',
311 |         type=float,
312 |         default=0.001,
313 |         help="""\
314 |         This is the inital learning rate value. The learning rate will decrease
315 |         during training. For more details check the model_fn implementation in
316 |         this file.\
317 |         """)
318 |     parser.add_argument(
319 |         '--epochs',
320 |         type=int,
321 |         default=10,
322 |         help='The number of steps to use for training.')
323 |     parser.add_argument(
324 |         '--batch-size',
325 |         type=int,
326 |         default=256, # ----- 수정 부분 (default 경로 수정) -----
327 |         help='Batch size for training.')
328 |     parser.add_argument(
329 |         '--optimizer',
330 |         type=str,
331 |         default='adam')
332 |     parser.add_argument(
333 |         '--momentum',
334 |         type=float,
335 |         default='0.9')
336 |     args = parser.parse_args()
337 |     main(args)
338 | 


--------------------------------------------------------------------------------
/training_script/cifar10_keras_pipe_solution.py:
--------------------------------------------------------------------------------
  1 | # Copyright 2019 Amazon.com, Inc. or its affiliates. All Rights Reserved.
  2 | #
  3 | # Permission is hereby granted, free of charge, to any person obtaining a copy of this
  4 | # software and associated documentation files (the "Software"), to deal in the Software
  5 | # without restriction, including without limitation the rights to use, copy, modify,
  6 | # merge, publish, distribute, sublicense, and/or sell copies of the Software, and to
  7 | # permit persons to whom the Software is furnished to do so.
  8 | #
  9 | # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED,
 10 | # INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A
 11 | # PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
 12 | # HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
 13 | # OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
 14 | # SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
 15 | 
 16 | from __future__ import absolute_import
 17 | from __future__ import division
 18 | from __future__ import print_function
 19 | 
 20 | import argparse
 21 | import logging
 22 | import os
 23 | 
 24 | from keras.callbacks import ModelCheckpoint
 25 | from keras.layers import Activation, Conv2D, Dense, Dropout, Flatten, MaxPooling2D, BatchNormalization
 26 | from keras.models import Sequential
 27 | from keras.optimizers import Adam, SGD, RMSprop
 28 | import tensorflow as tf
 29 | from keras import backend as K
 30 | 
 31 | # ----- 추가 부분 -----
 32 | from sagemaker_tensorflow import PipeModeDataset
 33 | 
 34 | sess = tf.Session()
 35 | K.set_session(sess)
 36 | 
 37 | logging.getLogger().setLevel(logging.INFO)
 38 | tf.logging.set_verbosity(tf.logging.INFO)
 39 | HEIGHT = 32
 40 | WIDTH = 32
 41 | DEPTH = 3
 42 | NUM_CLASSES = 10
 43 | NUM_DATA_BATCHES = 5
 44 | NUM_EXAMPLES_PER_EPOCH_FOR_TRAIN = 10000 * NUM_DATA_BATCHES
 45 | INPUT_TENSOR_NAME = 'inputs_input'  # needs to match the name of the first layer + "_input"
 46 | 
 47 | def keras_model_fn(learning_rate, weight_decay, optimizer, momentum):
 48 |     """keras_model_fn receives hyperparameters from the training job and returns a compiled keras model.
 49 |     The model will be transformed into a TensorFlow Estimator before training and it will be saved in a 
 50 |     TensorFlow Serving SavedModel at the end of training.
 51 | 
 52 |     Args:
 53 |         hyperparameters: The hyperparameters passed to the SageMaker TrainingJob that runs your TensorFlow 
 54 |                          training script.
 55 |     Returns: A compiled Keras model
 56 |     """
 57 |     model = Sequential()
 58 |     model.add(Conv2D(32, (3, 3), padding='same', name='inputs', input_shape=(HEIGHT, WIDTH, DEPTH)))
 59 |     model.add(BatchNormalization())
 60 |     model.add(Activation('relu'))
 61 |     model.add(Conv2D(32, (3, 3)))
 62 |     model.add(BatchNormalization())
 63 |     model.add(Activation('relu'))
 64 |     model.add(MaxPooling2D(pool_size=(2, 2)))
 65 |     model.add(Dropout(0.2))
 66 | 
 67 |     model.add(Conv2D(64, (3, 3), padding='same'))
 68 |     model.add(BatchNormalization())
 69 |     model.add(Activation('relu'))
 70 |     model.add(Conv2D(64, (3, 3)))
 71 |     model.add(BatchNormalization())
 72 |     model.add(Activation('relu'))
 73 |     model.add(MaxPooling2D(pool_size=(2, 2)))
 74 |     model.add(Dropout(0.3))
 75 | 
 76 |     model.add(Conv2D(128, (3, 3), padding='same'))
 77 |     model.add(BatchNormalization())
 78 |     model.add(Activation('relu'))
 79 |     model.add(Conv2D(128, (3, 3)))
 80 |     model.add(BatchNormalization())
 81 |     model.add(Activation('relu'))
 82 |     model.add(MaxPooling2D(pool_size=(2, 2)))
 83 |     model.add(Dropout(0.4))
 84 | 
 85 |     model.add(Flatten())
 86 |     model.add(Dense(512))
 87 |     model.add(Activation('relu'))
 88 |     model.add(Dropout(0.5))
 89 |     model.add(Dense(NUM_CLASSES))
 90 |     model.add(Activation('softmax'))
 91 | 
 92 |     size = 1
 93 | 
 94 |     if optimizer.lower() == 'sgd':
 95 |         opt = SGD(lr=learning_rate * size, decay=weight_decay, momentum=momentum)
 96 |     elif optimizer.lower() == 'rmsprop':
 97 |         opt = RMSprop(lr=learning_rate * size, decay=weight_decay)
 98 |     else:
 99 |         opt = Adam(lr=learning_rate * size, decay=weight_decay)
100 | 
101 |     model.compile(loss='categorical_crossentropy',
102 |                   optimizer=opt,
103 |                   metrics=['accuracy'])
104 |     return model
105 | 
106 | 
107 | def get_filenames(channel_name, channel):
108 |     if channel_name in ['train', 'validation', 'eval']:
109 |         return [os.path.join(channel, channel_name + '.tfrecords')]
110 |     else:
111 |         raise ValueError('Invalid data subset "%s"' % channel_name)
112 | 
113 | 
114 | def train_input_fn():
115 |     return _input(args.epochs, args.batch_size, args.train, 'train')
116 | 
117 | 
118 | def eval_input_fn():
119 |     return _input(args.epochs, args.batch_size, args.eval, 'eval')
120 | 
121 | 
122 | def validation_input_fn():
123 |     return _input(args.epochs, args.batch_size, args.validation, 'validation')
124 | 
125 | 
126 | def _input(epochs, batch_size, channel, channel_name):
127 | 
128 |     filenames = get_filenames(channel_name, channel)
129 |     # ----- 추가 부분 (PipeModeDataSet) -----
130 |     #dataset = tf.data.TFRecordDataset(filenames)
131 |     dataset = PipeModeDataset(channel=channel_name, record_format='TFRecord')    
132 | 
133 |     dataset = dataset.repeat(epochs)
134 |     dataset = dataset.prefetch(10)
135 | 
136 |     # Parse records.
137 |     dataset = dataset.map(
138 |         _dataset_parser, num_parallel_calls=10)
139 | 
140 |     # Potentially shuffle records.
141 |     if channel_name == 'train':
142 |         # Ensure that the capacity is sufficiently large to provide good random
143 |         # shuffling.
144 |         buffer_size = int(NUM_EXAMPLES_PER_EPOCH_FOR_TRAIN * 0.4) + 3 * batch_size
145 |         dataset = dataset.shuffle(buffer_size=buffer_size)
146 | 
147 |     # Batch it up.
148 |     dataset = dataset.batch(batch_size, drop_remainder=True)
149 |     iterator = dataset.make_one_shot_iterator()
150 |     image_batch, label_batch = iterator.get_next()
151 | 
152 |     return {INPUT_TENSOR_NAME: image_batch}, label_batch
153 | 
154 | 
155 | def _train_preprocess_fn(image):
156 |     """Preprocess a single training image of layout [height, width, depth]."""
157 |     # Resize the image to add four extra pixels on each side.
158 |     image = tf.image.resize_image_with_crop_or_pad(image, HEIGHT + 8, WIDTH + 8)
159 | 
160 |     # Randomly crop a [HEIGHT, WIDTH] section of the image.
161 |     image = tf.random_crop(image, [HEIGHT, WIDTH, DEPTH])
162 | 
163 |     # Randomly flip the image horizontally.
164 |     image = tf.image.random_flip_left_right(image)
165 | 
166 |     return image
167 | 
168 | 
169 | def _dataset_parser(value):
170 |     """Parse a CIFAR-10 record from value."""
171 |     featdef = {
172 |         'image': tf.FixedLenFeature([], tf.string),
173 |         'label': tf.FixedLenFeature([], tf.int64),
174 |     }
175 | 
176 |     example = tf.parse_single_example(value, featdef)
177 |     image = tf.decode_raw(example['image'], tf.uint8)
178 |     image.set_shape([DEPTH * HEIGHT * WIDTH])
179 | 
180 |     # Reshape from [depth * height * width] to [depth, height, width].
181 |     image = tf.cast(
182 |         tf.transpose(tf.reshape(image, [DEPTH, HEIGHT, WIDTH]), [1, 2, 0]),
183 |         tf.float32)
184 |     label = tf.cast(example['label'], tf.int32)
185 |     image = _train_preprocess_fn(image)
186 |     return image, tf.one_hot(label, NUM_CLASSES)
187 | 
188 | def save_model(model, output):
189 |     signature = tf.saved_model.signature_def_utils.predict_signature_def(
190 |         inputs={'inputs': model.input}, outputs={'scores': model.output})
191 | 
192 |     builder = tf.saved_model.builder.SavedModelBuilder(output+'/1/')
193 |     builder.add_meta_graph_and_variables(
194 |         sess=K.get_session(),
195 |         tags=[tf.saved_model.tag_constants.SERVING],
196 |         signature_def_map={"serving_default": signature})
197 |     builder.save()
198 | 
199 |     logging.info("Model successfully saved at: {}".format(output))
200 |     return
201 | 
202 | def main(args):
203 |     logging.info("getting data")
204 |     train_dataset = train_input_fn()
205 |     eval_dataset = eval_input_fn()
206 |     validation_dataset = validation_input_fn()
207 | 
208 |     logging.info("configuring model")
209 |     model = keras_model_fn(args.learning_rate, args.weight_decay, args.optimizer, args.momentum)
210 |     callbacks = []
211 |         
212 |     # ----- 수정 부분 (경로 수정) -----
213 |     callbacks.append(ModelCheckpoint(args.model_output_dir + '/checkpoint-{epoch}.h5'))
214 | 
215 |     logging.info("Starting training")
216 |     model.fit(x=train_dataset[0], y=train_dataset[1],
217 |               steps_per_epoch=(num_examples_per_epoch('train') // args.batch_size),
218 |               epochs=args.epochs, validation_data=validation_dataset,
219 |               validation_steps=(num_examples_per_epoch('validation') // args.batch_size), callbacks=callbacks)
220 | 
221 |     score = model.evaluate(eval_dataset[0], eval_dataset[1], steps=num_examples_per_epoch('eval') // args.batch_size,
222 |                            verbose=0)
223 | 
224 |     logging.info('Test loss:{}'.format(score[0]))
225 |     logging.info('Test accuracy:{}'.format(score[1]))
226 | 
227 |     # ----- 수정 부분 (경로 수정) -----  
228 |     return save_model(model, args.model_output_dir)
229 | 
230 | def num_examples_per_epoch(subset='train'):
231 |     if subset == 'train':
232 |         return 40000
233 |     elif subset == 'validation':
234 |         return 10000
235 |     elif subset == 'eval':
236 |         return 10000
237 |     else:
238 |         raise ValueError('Invalid data subset "%s"' % subset)
239 | 
240 |         
241 | if __name__ == '__main__':
242 |     parser = argparse.ArgumentParser()
243 | 
244 |     parser.add_argument(
245 |         '--train',
246 |         type=str,
247 |         required=False,
248 |         default=os.environ['SM_CHANNEL_TRAIN'], # ----- 수정 부분 (default 경로 수정) -----
249 |         help='The directory where the CIFAR-10 input data is stored.')    
250 |     parser.add_argument(
251 |         '--validation',
252 |         type=str,
253 |         required=False,
254 |         default=os.environ['SM_CHANNEL_VALIDATION'], # ----- 수정 부분 (default 경로 수정) -----
255 |         help='The directory where the CIFAR-10 input data is stored.')
256 |     parser.add_argument(
257 |         '--eval',
258 |         type=str,
259 |         required=False,
260 |         default=os.environ['SM_CHANNEL_EVAL'], # ----- 수정 부분 (default 경로 수정) -----
261 |         help='The directory where the CIFAR-10 input data is stored.')
262 |     
263 |     # ----- 수정 부분 (argument 추가) -----
264 |     parser.add_argument(
265 |         '--model_output_dir',
266 |         type=str,
267 |         default=os.environ.get('SM_MODEL_DIR'))  
268 |     
269 |     parser.add_argument(
270 |         '--model_dir',
271 |         type=str,
272 |         required=True,
273 |         help='The directory where the model will be stored.')
274 |     
275 |     parser.add_argument(
276 |         '--weight-decay',
277 |         type=float,
278 |         default=2e-4,
279 |         help='Weight decay for convolutions.')
280 |     parser.add_argument(
281 |         '--learning-rate',
282 |         type=float,
283 |         default=0.001,
284 |         help="""\
285 |         This is the inital learning rate value. The learning rate will decrease
286 |         during training. For more details check the model_fn implementation in
287 |         this file.\
288 |         """)
289 |     parser.add_argument(
290 |         '--epochs',
291 |         type=int,
292 |         default=10,
293 |         help='The number of steps to use for training.')
294 |     parser.add_argument(
295 |         '--batch-size',
296 |         type=int,
297 |         default=128,
298 |         help='Batch size for training.')
299 |     parser.add_argument(
300 |         '--optimizer',
301 |         type=str,
302 |         default='adam')
303 |     parser.add_argument(
304 |         '--momentum',
305 |         type=float,
306 |         default='0.9')
307 |     args = parser.parse_args()
308 |     main(args)
309 | 


--------------------------------------------------------------------------------
/training_script/cifar10_keras_sm_solution.py:
--------------------------------------------------------------------------------
  1 | # Copyright 2019 Amazon.com, Inc. or its affiliates. All Rights Reserved.
  2 | #
  3 | # Permission is hereby granted, free of charge, to any person obtaining a copy of this
  4 | # software and associated documentation files (the "Software"), to deal in the Software
  5 | # without restriction, including without limitation the rights to use, copy, modify,
  6 | # merge, publish, distribute, sublicense, and/or sell copies of the Software, and to
  7 | # permit persons to whom the Software is furnished to do so.
  8 | #
  9 | # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED,
 10 | # INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A
 11 | # PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
 12 | # HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
 13 | # OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
 14 | # SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
 15 | 
 16 | from __future__ import absolute_import
 17 | from __future__ import division
 18 | from __future__ import print_function
 19 | 
 20 | import argparse
 21 | import logging
 22 | import os
 23 | 
 24 | from keras.callbacks import ModelCheckpoint
 25 | from keras.layers import Activation, Conv2D, Dense, Dropout, Flatten, MaxPooling2D, BatchNormalization
 26 | from keras.models import Sequential
 27 | from keras.optimizers import Adam, SGD, RMSprop
 28 | import tensorflow as tf
 29 | from keras import backend as K
 30 | 
 31 | sess = tf.Session()
 32 | K.set_session(sess)
 33 | 
 34 | logging.getLogger().setLevel(logging.INFO)
 35 | tf.logging.set_verbosity(tf.logging.INFO)
 36 | HEIGHT = 32
 37 | WIDTH = 32
 38 | DEPTH = 3
 39 | NUM_CLASSES = 10
 40 | NUM_DATA_BATCHES = 5
 41 | NUM_EXAMPLES_PER_EPOCH_FOR_TRAIN = 10000 * NUM_DATA_BATCHES
 42 | INPUT_TENSOR_NAME = 'inputs_input'  # needs to match the name of the first layer + "_input"
 43 | 
 44 | def keras_model_fn(learning_rate, weight_decay, optimizer, momentum):
 45 |     """keras_model_fn receives hyperparameters from the training job and returns a compiled keras model.
 46 |     The model will be transformed into a TensorFlow Estimator before training and it will be saved in a 
 47 |     TensorFlow Serving SavedModel at the end of training.
 48 | 
 49 |     Args:
 50 |         hyperparameters: The hyperparameters passed to the SageMaker TrainingJob that runs your TensorFlow 
 51 |                          training script.
 52 |     Returns: A compiled Keras model
 53 |     """
 54 |     model = Sequential()
 55 |     model.add(Conv2D(32, (3, 3), padding='same', name='inputs', input_shape=(HEIGHT, WIDTH, DEPTH)))
 56 |     model.add(BatchNormalization())
 57 |     model.add(Activation('relu'))
 58 |     model.add(Conv2D(32, (3, 3)))
 59 |     model.add(BatchNormalization())
 60 |     model.add(Activation('relu'))
 61 |     model.add(MaxPooling2D(pool_size=(2, 2)))
 62 |     model.add(Dropout(0.2))
 63 | 
 64 |     model.add(Conv2D(64, (3, 3), padding='same'))
 65 |     model.add(BatchNormalization())
 66 |     model.add(Activation('relu'))
 67 |     model.add(Conv2D(64, (3, 3)))
 68 |     model.add(BatchNormalization())
 69 |     model.add(Activation('relu'))
 70 |     model.add(MaxPooling2D(pool_size=(2, 2)))
 71 |     model.add(Dropout(0.3))
 72 | 
 73 |     model.add(Conv2D(128, (3, 3), padding='same'))
 74 |     model.add(BatchNormalization())
 75 |     model.add(Activation('relu'))
 76 |     model.add(Conv2D(128, (3, 3)))
 77 |     model.add(BatchNormalization())
 78 |     model.add(Activation('relu'))
 79 |     model.add(MaxPooling2D(pool_size=(2, 2)))
 80 |     model.add(Dropout(0.4))
 81 | 
 82 |     model.add(Flatten())
 83 |     model.add(Dense(512))
 84 |     model.add(Activation('relu'))
 85 |     model.add(Dropout(0.5))
 86 |     model.add(Dense(NUM_CLASSES))
 87 |     model.add(Activation('softmax'))
 88 | 
 89 |     size = 1
 90 | 
 91 |     if optimizer.lower() == 'sgd':
 92 |         opt = SGD(lr=learning_rate * size, decay=weight_decay, momentum=momentum)
 93 |     elif optimizer.lower() == 'rmsprop':
 94 |         opt = RMSprop(lr=learning_rate * size, decay=weight_decay)
 95 |     else:
 96 |         opt = Adam(lr=learning_rate * size, decay=weight_decay)
 97 | 
 98 |     model.compile(loss='categorical_crossentropy',
 99 |                   optimizer=opt,
100 |                   metrics=['accuracy'])
101 |     return model
102 | 
103 | 
104 | def get_filenames(channel_name, channel):
105 |     if channel_name in ['train', 'validation', 'eval']:
106 |         return [os.path.join(channel, channel_name + '.tfrecords')]
107 |     else:
108 |         raise ValueError('Invalid data subset "%s"' % channel_name)
109 | 
110 | 
111 | def train_input_fn():
112 |     return _input(args.epochs, args.batch_size, args.train, 'train')
113 | 
114 | 
115 | def eval_input_fn():
116 |     return _input(args.epochs, args.batch_size, args.eval, 'eval')
117 | 
118 | 
119 | def validation_input_fn():
120 |     return _input(args.epochs, args.batch_size, args.validation, 'validation')
121 | 
122 | 
123 | def _input(epochs, batch_size, channel, channel_name):
124 | 
125 |     filenames = get_filenames(channel_name, channel)
126 |     dataset = tf.data.TFRecordDataset(filenames)
127 | 
128 |     dataset = dataset.repeat(epochs)
129 |     dataset = dataset.prefetch(10)
130 | 
131 |     # Parse records.
132 |     dataset = dataset.map(
133 |         _dataset_parser, num_parallel_calls=10)
134 | 
135 |     # Potentially shuffle records.
136 |     if channel_name == 'train':
137 |         # Ensure that the capacity is sufficiently large to provide good random
138 |         # shuffling.
139 |         buffer_size = int(NUM_EXAMPLES_PER_EPOCH_FOR_TRAIN * 0.4) + 3 * batch_size
140 |         dataset = dataset.shuffle(buffer_size=buffer_size)
141 | 
142 |     # Batch it up.
143 |     dataset = dataset.batch(batch_size, drop_remainder=True)
144 |     iterator = dataset.make_one_shot_iterator()
145 |     image_batch, label_batch = iterator.get_next()
146 | 
147 |     return {INPUT_TENSOR_NAME: image_batch}, label_batch
148 | 
149 | 
150 | def _train_preprocess_fn(image):
151 |     """Preprocess a single training image of layout [height, width, depth]."""
152 |     # Resize the image to add four extra pixels on each side.
153 |     image = tf.image.resize_image_with_crop_or_pad(image, HEIGHT + 8, WIDTH + 8)
154 | 
155 |     # Randomly crop a [HEIGHT, WIDTH] section of the image.
156 |     image = tf.random_crop(image, [HEIGHT, WIDTH, DEPTH])
157 | 
158 |     # Randomly flip the image horizontally.
159 |     image = tf.image.random_flip_left_right(image)
160 | 
161 |     return image
162 | 
163 | 
164 | def _dataset_parser(value):
165 |     """Parse a CIFAR-10 record from value."""
166 |     featdef = {
167 |         'image': tf.FixedLenFeature([], tf.string),
168 |         'label': tf.FixedLenFeature([], tf.int64),
169 |     }
170 | 
171 |     example = tf.parse_single_example(value, featdef)
172 |     image = tf.decode_raw(example['image'], tf.uint8)
173 |     image.set_shape([DEPTH * HEIGHT * WIDTH])
174 | 
175 |     # Reshape from [depth * height * width] to [depth, height, width].
176 |     image = tf.cast(
177 |         tf.transpose(tf.reshape(image, [DEPTH, HEIGHT, WIDTH]), [1, 2, 0]),
178 |         tf.float32)
179 |     label = tf.cast(example['label'], tf.int32)
180 |     image = _train_preprocess_fn(image)
181 |     return image, tf.one_hot(label, NUM_CLASSES)
182 | 
183 | def save_model(model, output):
184 |     signature = tf.saved_model.signature_def_utils.predict_signature_def(
185 |         inputs={'inputs': model.input}, outputs={'scores': model.output})
186 | 
187 |     builder = tf.saved_model.builder.SavedModelBuilder(output+'/1/')
188 |     builder.add_meta_graph_and_variables(
189 |         sess=K.get_session(),
190 |         tags=[tf.saved_model.tag_constants.SERVING],
191 |         signature_def_map={"serving_default": signature})
192 |     builder.save()
193 | 
194 |     logging.info("Model successfully saved at: {}".format(output))
195 |     return
196 | 
197 | def main(args):
198 |     logging.info("getting data")
199 |     train_dataset = train_input_fn()
200 |     eval_dataset = eval_input_fn()
201 |     validation_dataset = validation_input_fn()
202 | 
203 |     logging.info("configuring model")
204 |     model = keras_model_fn(args.learning_rate, args.weight_decay, args.optimizer, args.momentum)
205 |     callbacks = []
206 |         
207 |     # ----- 수정 부분 (경로 수정) -----
208 |     callbacks.append(ModelCheckpoint(args.model_output_dir + '/checkpoint-{epoch}.h5'))
209 | 
210 |     logging.info("Starting training")
211 |     model.fit(x=train_dataset[0], y=train_dataset[1],
212 |               steps_per_epoch=(num_examples_per_epoch('train') // args.batch_size),
213 |               epochs=args.epochs, validation_data=validation_dataset,
214 |               validation_steps=(num_examples_per_epoch('validation') // args.batch_size), callbacks=callbacks)
215 | 
216 |     score = model.evaluate(eval_dataset[0], eval_dataset[1], steps=num_examples_per_epoch('eval') // args.batch_size,
217 |                            verbose=0)
218 | 
219 |     logging.info('Test loss:{}'.format(score[0]))
220 |     logging.info('Test accuracy:{}'.format(score[1]))
221 | 
222 |     # ----- 수정 부분 (경로 수정) -----  
223 |     return save_model(model, args.model_output_dir)
224 | 
225 | def num_examples_per_epoch(subset='train'):
226 |     if subset == 'train':
227 |         return 40000
228 |     elif subset == 'validation':
229 |         return 10000
230 |     elif subset == 'eval':
231 |         return 10000
232 |     else:
233 |         raise ValueError('Invalid data subset "%s"' % subset)
234 | 
235 |         
236 | if __name__ == '__main__':
237 |     parser = argparse.ArgumentParser()
238 | 
239 |     parser.add_argument(
240 |         '--train',
241 |         type=str,
242 |         required=False,
243 |         default=os.environ['SM_CHANNEL_TRAIN'], # ----- 수정 부분 (default 경로 수정) -----
244 |         help='The directory where the CIFAR-10 input data is stored.')    
245 |     parser.add_argument(
246 |         '--validation',
247 |         type=str,
248 |         required=False,
249 |         default=os.environ['SM_CHANNEL_VALIDATION'], # ----- 수정 부분 (default 경로 수정) -----
250 |         help='The directory where the CIFAR-10 input data is stored.')
251 |     parser.add_argument(
252 |         '--eval',
253 |         type=str,
254 |         required=False,
255 |         default=os.environ['SM_CHANNEL_EVAL'], # ----- 수정 부분 (default 경로 수정) -----
256 |         help='The directory where the CIFAR-10 input data is stored.')
257 |     
258 |     # ----- 수정 부분 (argument 추가) -----
259 |     parser.add_argument(
260 |         '--model_output_dir',
261 |         type=str,
262 |         default=os.environ.get('SM_MODEL_DIR'))  
263 |     
264 |     parser.add_argument(
265 |         '--model_dir',
266 |         type=str,
267 |         required=True,
268 |         help='The directory where the model will be stored.')
269 |     
270 |     parser.add_argument(
271 |         '--weight-decay',
272 |         type=float,
273 |         default=2e-4,
274 |         help='Weight decay for convolutions.')
275 |     parser.add_argument(
276 |         '--learning-rate',
277 |         type=float,
278 |         default=0.001,
279 |         help="""\
280 |         This is the inital learning rate value. The learning rate will decrease
281 |         during training. For more details check the model_fn implementation in
282 |         this file.\
283 |         """)
284 |     parser.add_argument(
285 |         '--epochs',
286 |         type=int,
287 |         default=10,
288 |         help='The number of steps to use for training.')
289 |     parser.add_argument(
290 |         '--batch-size',
291 |         type=int,
292 |         default=128,
293 |         help='Batch size for training.')
294 |     parser.add_argument(
295 |         '--optimizer',
296 |         type=str,
297 |         default='adam')
298 |     parser.add_argument(
299 |         '--momentum',
300 |         type=float,
301 |         default='0.9')
302 |     args = parser.parse_args()
303 |     main(args)
304 | 


--------------------------------------------------------------------------------
/training_script/cifar10_keras_sm_tf2.py:
--------------------------------------------------------------------------------
  1 | # Copyright 2019 Amazon.com, Inc. or its affiliates. All Rights Reserved.
  2 | #
  3 | # Permission is hereby granted, free of charge, to any person obtaining a copy of this
  4 | # software and associated documentation files (the "Software"), to deal in the Software
  5 | # without restriction, including without limitation the rights to use, copy, modify,
  6 | # merge, publish, distribute, sublicense, and/or sell copies of the Software, and to
  7 | # permit persons to whom the Software is furnished to do so.
  8 | #
  9 | # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED,
 10 | # INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A
 11 | # PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
 12 | # HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
 13 | # OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
 14 | # SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
 15 | 
 16 | from __future__ import absolute_import
 17 | from __future__ import division
 18 | from __future__ import print_function
 19 | 
 20 | import argparse
 21 | import logging
 22 | import os
 23 | 
 24 | import tensorflow as tf
 25 | from tensorflow.keras.callbacks import ModelCheckpoint
 26 | from tensorflow.keras.layers import Activation, Conv2D, Dense, Dropout, Flatten, MaxPooling2D, BatchNormalization
 27 | from tensorflow.keras.models import Sequential
 28 | from tensorflow.keras.optimizers import Adam, SGD, RMSprop
 29 | tf.get_logger().setLevel('INFO')
 30 | #tf.autograph.set_verbosity(1)
 31 | 
 32 | HEIGHT = 32
 33 | WIDTH = 32
 34 | DEPTH = 3
 35 | NUM_CLASSES = 10
 36 | NUM_DATA_BATCHES = 5
 37 | NUM_EXAMPLES_PER_EPOCH_FOR_TRAIN = 10000 * NUM_DATA_BATCHES
 38 | INPUT_TENSOR_NAME = 'inputs_input'  # needs to match the name of the first layer + "_input"
 39 | 
 40 | def keras_model_fn(learning_rate, weight_decay, optimizer, momentum):
 41 |     """keras_model_fn receives hyperparameters from the training job and returns a compiled keras model.
 42 |     The model will be transformed into a TensorFlow Estimator before training and it will be saved in a 
 43 |     TensorFlow Serving SavedModel at the end of training.
 44 | 
 45 |     Args:
 46 |         hyperparameters: The hyperparameters passed to the SageMaker TrainingJob that runs your TensorFlow 
 47 |                          training script.
 48 |     Returns: A compiled Keras model
 49 |     """
 50 |     model = Sequential()
 51 |     model.add(Conv2D(32, (3, 3), padding='same', name='inputs', input_shape=(HEIGHT, WIDTH, DEPTH)))
 52 |     model.add(BatchNormalization())
 53 |     model.add(Activation('relu'))
 54 |     model.add(Conv2D(32, (3, 3)))
 55 |     model.add(BatchNormalization())
 56 |     model.add(Activation('relu'))
 57 |     model.add(MaxPooling2D(pool_size=(2, 2)))
 58 |     model.add(Dropout(0.2))
 59 | 
 60 |     model.add(Conv2D(64, (3, 3), padding='same'))
 61 |     model.add(BatchNormalization())
 62 |     model.add(Activation('relu'))
 63 |     model.add(Conv2D(64, (3, 3)))
 64 |     model.add(BatchNormalization())
 65 |     model.add(Activation('relu'))
 66 |     model.add(MaxPooling2D(pool_size=(2, 2)))
 67 |     model.add(Dropout(0.3))
 68 | 
 69 |     model.add(Conv2D(128, (3, 3), padding='same'))
 70 |     model.add(BatchNormalization())
 71 |     model.add(Activation('relu'))
 72 |     model.add(Conv2D(128, (3, 3)))
 73 |     model.add(BatchNormalization())
 74 |     model.add(Activation('relu'))
 75 |     model.add(MaxPooling2D(pool_size=(2, 2)))
 76 |     model.add(Dropout(0.4))
 77 | 
 78 |     model.add(Flatten())
 79 |     model.add(Dense(512))
 80 |     model.add(Activation('relu'))
 81 |     model.add(Dropout(0.5))
 82 |     model.add(Dense(NUM_CLASSES))
 83 |     model.add(Activation('softmax'))
 84 | 
 85 |     size = 1
 86 | 
 87 |     if optimizer.lower() == 'sgd':
 88 |         opt = SGD(lr=learning_rate * size, decay=weight_decay, momentum=momentum)
 89 |     elif optimizer.lower() == 'rmsprop':
 90 |         opt = RMSprop(lr=learning_rate * size, decay=weight_decay)
 91 |     else:
 92 |         opt = Adam(lr=learning_rate * size, decay=weight_decay)
 93 | 
 94 |     model.compile(loss='categorical_crossentropy',
 95 |                   optimizer=opt,
 96 |                   metrics=['accuracy'])
 97 |     return model
 98 | 
 99 | 
100 | def get_filenames(channel_name, channel):
101 |     if channel_name in ['train', 'validation', 'eval']:
102 |         return [os.path.join(channel, channel_name + '.tfrecords')]
103 |     else:
104 |         raise ValueError('Invalid data subset "%s"' % channel_name)
105 | 
106 | 
107 | def train_input_fn():
108 |     return _input(args.epochs, args.batch_size, args.train, 'train')
109 | 
110 | 
111 | def eval_input_fn():
112 |     return _input(args.epochs, args.batch_size, args.eval, 'eval')
113 | 
114 | 
115 | def validation_input_fn():
116 |     return _input(args.epochs, args.batch_size, args.validation, 'validation')
117 | 
118 | 
119 | def _input(epochs, batch_size, channel, channel_name):
120 | 
121 |     filenames = get_filenames(channel_name, channel)
122 |     dataset = tf.data.TFRecordDataset(filenames)
123 |     #dataset = dataset.interleave(tf.data.TFRecordDataset, cycle_length=3)
124 | 
125 |     # Parse records.
126 |     dataset = dataset.map(_dataset_parser, num_parallel_calls=10)
127 | 
128 |     dataset = dataset.repeat()    
129 |     
130 |     # Potentially shuffle records.
131 |     if channel_name == 'train':
132 |         # Ensure that the capacity is sufficiently large to provide good random
133 |         # shuffling.
134 |         buffer_size = int(NUM_EXAMPLES_PER_EPOCH_FOR_TRAIN * 0.4) + 3 * batch_size
135 |         dataset = dataset.shuffle(buffer_size=buffer_size)
136 |  
137 |     # Batch it up.
138 |     dataset = dataset.batch(batch_size, drop_remainder=True)
139 |     dataset = dataset.prefetch(tf.data.experimental.AUTOTUNE)
140 | 
141 |     return dataset
142 | 
143 | 
144 | def _train_preprocess_fn(image):
145 |     """Preprocess a single training image of layout [height, width, depth]."""
146 |     # Resize the image to add four extra pixels on each side.
147 |     image = tf.image.resize_with_crop_or_pad(image, HEIGHT + 8, WIDTH + 8)
148 | 
149 |     # Randomly crop a [HEIGHT, WIDTH] section of the image.
150 |     image = tf.image.random_crop(image, [HEIGHT, WIDTH, DEPTH])
151 | 
152 |     # Randomly flip the image horizontally.
153 |     image = tf.image.random_flip_left_right(image)
154 | 
155 |     return image
156 | 
157 | 
158 | def _dataset_parser(value):
159 |     """Parse a CIFAR-10 record from value."""
160 |     featdef = {
161 |         'image': tf.io.FixedLenFeature([], tf.string),
162 |         'label': tf.io.FixedLenFeature([], tf.int64),
163 |     }
164 | 
165 |     example = tf.io.parse_single_example(value, featdef)
166 |     image = tf.io.decode_raw(example['image'], tf.uint8)
167 |     image.set_shape([DEPTH * HEIGHT * WIDTH])
168 | 
169 |     # Reshape from [depth * height * width] to [depth, height, width].
170 |     image = tf.cast(
171 |         tf.transpose(tf.reshape(image, [DEPTH, HEIGHT, WIDTH]), [1, 2, 0]),
172 |         tf.float32)
173 |     label = tf.cast(example['label'], tf.int32)
174 |     image = _train_preprocess_fn(image)
175 |     return image, tf.one_hot(label, NUM_CLASSES)
176 | 
177 | def save_model(model, output):
178 |     tf.saved_model.save(model, output+'/1/')
179 |     logging.info("Model successfully saved at: {}".format(output))
180 |     return
181 | 
182 | def main(args):
183 |     logging.info("getting data")
184 |     train_dataset = train_input_fn()
185 |     eval_dataset = eval_input_fn()
186 |     validation_dataset = validation_input_fn()
187 | 
188 |     logging.info("configuring model")
189 |     model = keras_model_fn(args.learning_rate, args.weight_decay, args.optimizer, args.momentum)
190 |     callbacks = []
191 | 
192 |     callbacks.append(ModelCheckpoint(args.model_dir + '/checkpoint-{epoch}.h5'))
193 | 
194 |     logging.info("Starting training")
195 |     model.fit(train_dataset,
196 |               steps_per_epoch=(num_examples_per_epoch('train') // args.batch_size),
197 |               epochs=args.epochs, 
198 |               validation_data=validation_dataset,
199 |               validation_steps=(num_examples_per_epoch('validation') // args.batch_size),
200 |               callbacks=callbacks)
201 | 
202 |     score = model.evaluate(eval_dataset, steps=num_examples_per_epoch('eval') // args.batch_size,
203 |                            verbose=0)
204 | 
205 |     logging.info('Test loss:{}'.format(score[0]))
206 |     logging.info('Test accuracy:{}'.format(score[1]))
207 | 
208 |     return save_model(model, args.model_dir)
209 | 
210 | def num_examples_per_epoch(subset='train'):
211 |     if subset == 'train':
212 |         return 40000
213 |     elif subset == 'validation':
214 |         return 10000
215 |     elif subset == 'eval':
216 |         return 10000
217 |     else:
218 |         raise ValueError('Invalid data subset "%s"' % subset)
219 | 
220 |                  
221 | if __name__ == '__main__':
222 |     parser = argparse.ArgumentParser()
223 |     parser.add_argument(
224 |         '--train',
225 |         type=str,
226 |         required=False,
227 |         help='The directory where the CIFAR-10 input data is stored.')
228 |     parser.add_argument(
229 |         '--validation',
230 |         type=str,
231 |         required=False,
232 |         help='The directory where the CIFAR-10 input data is stored.')
233 |     parser.add_argument(
234 |         '--eval',
235 |         type=str,
236 |         required=False,
237 |         help='The directory where the CIFAR-10 input data is stored.')
238 |     parser.add_argument(
239 |         '--model_dir',
240 |         type=str,
241 |         required=True,
242 |         help='The directory where the model will be stored.')
243 |     parser.add_argument(
244 |         '--weight-decay',
245 |         type=float,
246 |         default=2e-4,
247 |         help='Weight decay for convolutions.')
248 |     parser.add_argument(
249 |         '--learning-rate',
250 |         type=float,
251 |         default=0.001,
252 |         help="""\
253 |         This is the inital learning rate value. The learning rate will decrease
254 |         during training. For more details check the model_fn implementation in
255 |         this file.\
256 |         """)
257 |     parser.add_argument(
258 |         '--epochs',
259 |         type=int,
260 |         default=10,
261 |         help='The number of steps to use for training.')
262 |     parser.add_argument(
263 |         '--batch-size',
264 |         type=int,
265 |         default=128,
266 |         help='Batch size for training.')
267 |     parser.add_argument(
268 |         '--optimizer',
269 |         type=str,
270 |         default='adam')
271 |     parser.add_argument(
272 |         '--momentum',
273 |         type=float,
274 |         default='0.9')
275 |     args = parser.parse_args()
276 |     main(args)


--------------------------------------------------------------------------------
/training_script/cifar10_keras_sm_tf2_solution.py:
--------------------------------------------------------------------------------
  1 | # Copyright 2019 Amazon.com, Inc. or its affiliates. All Rights Reserved.
  2 | #
  3 | # Permission is hereby granted, free of charge, to any person obtaining a copy of this
  4 | # software and associated documentation files (the "Software"), to deal in the Software
  5 | # without restriction, including without limitation the rights to use, copy, modify,
  6 | # merge, publish, distribute, sublicense, and/or sell copies of the Software, and to
  7 | # permit persons to whom the Software is furnished to do so.
  8 | #
  9 | # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED,
 10 | # INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A
 11 | # PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
 12 | # HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
 13 | # OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
 14 | # SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
 15 | 
 16 | from __future__ import absolute_import
 17 | from __future__ import division
 18 | from __future__ import print_function
 19 | 
 20 | import argparse
 21 | import logging
 22 | import os
 23 | 
 24 | import tensorflow as tf
 25 | from tensorflow.keras.callbacks import ModelCheckpoint
 26 | from tensorflow.keras.layers import Activation, Conv2D, Dense, Dropout, Flatten, MaxPooling2D, BatchNormalization
 27 | from tensorflow.keras.models import Sequential
 28 | from tensorflow.keras.optimizers import Adam, SGD, RMSprop
 29 | tf.get_logger().setLevel('INFO')
 30 | #tf.autograph.set_verbosity(1)
 31 | 
 32 | HEIGHT = 32
 33 | WIDTH = 32
 34 | DEPTH = 3
 35 | NUM_CLASSES = 10
 36 | NUM_DATA_BATCHES = 5
 37 | NUM_EXAMPLES_PER_EPOCH_FOR_TRAIN = 10000 * NUM_DATA_BATCHES
 38 | INPUT_TENSOR_NAME = 'inputs_input'  # needs to match the name of the first layer + "_input"
 39 | 
 40 | def keras_model_fn(learning_rate, weight_decay, optimizer, momentum):
 41 |     """keras_model_fn receives hyperparameters from the training job and returns a compiled keras model.
 42 |     The model will be transformed into a TensorFlow Estimator before training and it will be saved in a 
 43 |     TensorFlow Serving SavedModel at the end of training.
 44 | 
 45 |     Args:
 46 |         hyperparameters: The hyperparameters passed to the SageMaker TrainingJob that runs your TensorFlow 
 47 |                          training script.
 48 |     Returns: A compiled Keras model
 49 |     """
 50 |     model = Sequential()
 51 |     model.add(Conv2D(32, (3, 3), padding='same', name='inputs', input_shape=(HEIGHT, WIDTH, DEPTH)))
 52 |     model.add(BatchNormalization())
 53 |     model.add(Activation('relu'))
 54 |     model.add(Conv2D(32, (3, 3)))
 55 |     model.add(BatchNormalization())
 56 |     model.add(Activation('relu'))
 57 |     model.add(MaxPooling2D(pool_size=(2, 2)))
 58 |     model.add(Dropout(0.2))
 59 | 
 60 |     model.add(Conv2D(64, (3, 3), padding='same'))
 61 |     model.add(BatchNormalization())
 62 |     model.add(Activation('relu'))
 63 |     model.add(Conv2D(64, (3, 3)))
 64 |     model.add(BatchNormalization())
 65 |     model.add(Activation('relu'))
 66 |     model.add(MaxPooling2D(pool_size=(2, 2)))
 67 |     model.add(Dropout(0.3))
 68 | 
 69 |     model.add(Conv2D(128, (3, 3), padding='same'))
 70 |     model.add(BatchNormalization())
 71 |     model.add(Activation('relu'))
 72 |     model.add(Conv2D(128, (3, 3)))
 73 |     model.add(BatchNormalization())
 74 |     model.add(Activation('relu'))
 75 |     model.add(MaxPooling2D(pool_size=(2, 2)))
 76 |     model.add(Dropout(0.4))
 77 | 
 78 |     model.add(Flatten())
 79 |     model.add(Dense(512))
 80 |     model.add(Activation('relu'))
 81 |     model.add(Dropout(0.5))
 82 |     model.add(Dense(NUM_CLASSES))
 83 |     model.add(Activation('softmax'))
 84 | 
 85 |     size = 1
 86 | 
 87 |     if optimizer.lower() == 'sgd':
 88 |         opt = SGD(lr=learning_rate * size, decay=weight_decay, momentum=momentum)
 89 |     elif optimizer.lower() == 'rmsprop':
 90 |         opt = RMSprop(lr=learning_rate * size, decay=weight_decay)
 91 |     else:
 92 |         opt = Adam(lr=learning_rate * size, decay=weight_decay)
 93 | 
 94 |     model.compile(loss='categorical_crossentropy',
 95 |                   optimizer=opt,
 96 |                   metrics=['accuracy'])
 97 |     return model
 98 | 
 99 | 
100 | def get_filenames(channel_name, channel):
101 |     if channel_name in ['train', 'validation', 'eval']:
102 |         return [os.path.join(channel, channel_name + '.tfrecords')]
103 |     else:
104 |         raise ValueError('Invalid data subset "%s"' % channel_name)
105 | 
106 | 
107 | def train_input_fn():
108 |     return _input(args.epochs, args.batch_size, args.train, 'train')
109 | 
110 | 
111 | def eval_input_fn():
112 |     return _input(args.epochs, args.batch_size, args.eval, 'eval')
113 | 
114 | 
115 | def validation_input_fn():
116 |     return _input(args.epochs, args.batch_size, args.validation, 'validation')
117 | 
118 | 
119 | def _input(epochs, batch_size, channel, channel_name):
120 | 
121 |     filenames = get_filenames(channel_name, channel)
122 |     dataset = tf.data.TFRecordDataset(filenames)
123 |     #dataset = dataset.interleave(tf.data.TFRecordDataset, cycle_length=3)
124 | 
125 |     # Parse records.
126 |     dataset = dataset.map(_dataset_parser, num_parallel_calls=10)
127 | 
128 |     dataset = dataset.repeat()    
129 |     
130 |     # Potentially shuffle records.
131 |     if channel_name == 'train':
132 |         # Ensure that the capacity is sufficiently large to provide good random
133 |         # shuffling.
134 |         buffer_size = int(NUM_EXAMPLES_PER_EPOCH_FOR_TRAIN * 0.4) + 3 * batch_size
135 |         dataset = dataset.shuffle(buffer_size=buffer_size)
136 |  
137 |     # Batch it up.
138 |     dataset = dataset.batch(batch_size, drop_remainder=True)
139 |     dataset = dataset.prefetch(tf.data.experimental.AUTOTUNE)
140 | 
141 |     return dataset
142 | 
143 | 
144 | def _train_preprocess_fn(image):
145 |     """Preprocess a single training image of layout [height, width, depth]."""
146 |     # Resize the image to add four extra pixels on each side.
147 |     image = tf.image.resize_with_crop_or_pad(image, HEIGHT + 8, WIDTH + 8)
148 | 
149 |     # Randomly crop a [HEIGHT, WIDTH] section of the image.
150 |     image = tf.image.random_crop(image, [HEIGHT, WIDTH, DEPTH])
151 | 
152 |     # Randomly flip the image horizontally.
153 |     image = tf.image.random_flip_left_right(image)
154 | 
155 |     return image
156 | 
157 | 
158 | def _dataset_parser(value):
159 |     """Parse a CIFAR-10 record from value."""
160 |     featdef = {
161 |         'image': tf.io.FixedLenFeature([], tf.string),
162 |         'label': tf.io.FixedLenFeature([], tf.int64),
163 |     }
164 | 
165 |     example = tf.io.parse_single_example(value, featdef)
166 |     image = tf.io.decode_raw(example['image'], tf.uint8)
167 |     image.set_shape([DEPTH * HEIGHT * WIDTH])
168 | 
169 |     # Reshape from [depth * height * width] to [depth, height, width].
170 |     image = tf.cast(
171 |         tf.transpose(tf.reshape(image, [DEPTH, HEIGHT, WIDTH]), [1, 2, 0]),
172 |         tf.float32)
173 |     label = tf.cast(example['label'], tf.int32)
174 |     image = _train_preprocess_fn(image)
175 |     return image, tf.one_hot(label, NUM_CLASSES)
176 | 
177 | def save_model(model, output):
178 |     tf.saved_model.save(model, output+'/1/')
179 |     logging.info("Model successfully saved at: {}".format(output))
180 |     return
181 | 
182 | def main(args):
183 |     logging.info("getting data")
184 |     train_dataset = train_input_fn()
185 |     eval_dataset = eval_input_fn()
186 |     validation_dataset = validation_input_fn()
187 | 
188 |     logging.info("configuring model")
189 |     model = keras_model_fn(args.learning_rate, args.weight_decay, args.optimizer, args.momentum)
190 |     callbacks = []
191 | 
192 |     # ----- 수정 부분 (경로 수정) -----
193 |     callbacks.append(ModelCheckpoint(args.model_output_dir + '/checkpoint-{epoch}.h5'))
194 | 
195 |     logging.info("Starting training")
196 |     model.fit(train_dataset,
197 |               steps_per_epoch=(num_examples_per_epoch('train') // args.batch_size),
198 |               epochs=args.epochs, 
199 |               validation_data=validation_dataset,
200 |               validation_steps=(num_examples_per_epoch('validation') // args.batch_size),
201 |               callbacks=callbacks)
202 | 
203 |     score = model.evaluate(eval_dataset, steps=num_examples_per_epoch('eval') // args.batch_size,
204 |                            verbose=0)
205 | 
206 |     logging.info('Test loss:{}'.format(score[0]))
207 |     logging.info('Test accuracy:{}'.format(score[1]))
208 | 
209 |     # ----- 수정 부분 (경로 수정) -----  
210 |     return save_model(model, args.model_output_dir)
211 | 
212 | def num_examples_per_epoch(subset='train'):
213 |     if subset == 'train':
214 |         return 40000
215 |     elif subset == 'validation':
216 |         return 10000
217 |     elif subset == 'eval':
218 |         return 10000
219 |     else:
220 |         raise ValueError('Invalid data subset "%s"' % subset)
221 | 
222 | 
223 | if __name__ == '__main__':
224 |     parser = argparse.ArgumentParser()
225 | 
226 |     parser.add_argument(
227 |         '--train',
228 |         type=str,
229 |         required=False,
230 |         default=os.environ['SM_CHANNEL_TRAIN'], # ----- 수정 부분 (default 경로 수정) -----
231 |         help='The directory where the CIFAR-10 input data is stored.')    
232 |     parser.add_argument(
233 |         '--validation',
234 |         type=str,
235 |         required=False,
236 |         default=os.environ['SM_CHANNEL_VALIDATION'], # ----- 수정 부분 (default 경로 수정) -----
237 |         help='The directory where the CIFAR-10 input data is stored.')
238 |     parser.add_argument(
239 |         '--eval',
240 |         type=str,
241 |         required=False,
242 |         default=os.environ['SM_CHANNEL_EVAL'], # ----- 수정 부분 (default 경로 수정) -----
243 |         help='The directory where the CIFAR-10 input data is stored.')
244 |     
245 |     # ----- 수정 부분 (argument 추가) -----
246 |     parser.add_argument(
247 |         '--model_output_dir',
248 |         type=str,
249 |         default=os.environ.get('SM_MODEL_DIR'))  
250 |     
251 |     parser.add_argument(
252 |         '--model_dir',
253 |         type=str,
254 |         required=True,
255 |         help='The directory where the model will be stored.')
256 |     
257 |     parser.add_argument(
258 |         '--weight-decay',
259 |         type=float,
260 |         default=2e-4,
261 |         help='Weight decay for convolutions.')
262 |     parser.add_argument(
263 |         '--learning-rate',
264 |         type=float,
265 |         default=0.001,
266 |         help="""\
267 |         This is the inital learning rate value. The learning rate will decrease
268 |         during training. For more details check the model_fn implementation in
269 |         this file.\
270 |         """)
271 |     parser.add_argument(
272 |         '--epochs',
273 |         type=int,
274 |         default=10,
275 |         help='The number of steps to use for training.')
276 |     parser.add_argument(
277 |         '--batch-size',
278 |         type=int,
279 |         default=128,
280 |         help='Batch size for training.')
281 |     parser.add_argument(
282 |         '--optimizer',
283 |         type=str,
284 |         default='adam')
285 |     parser.add_argument(
286 |         '--momentum',
287 |         type=float,
288 |         default='0.9')
289 |     args = parser.parse_args()
290 |     main(args)
291 | 


--------------------------------------------------------------------------------
/training_script/cifar10_keras_tensorboard_solution.py:
--------------------------------------------------------------------------------
  1 | # Copyright 2019 Amazon.com, Inc. or its affiliates. All Rights Reserved.
  2 | #
  3 | # Permission is hereby granted, free of charge, to any person obtaining a copy of this
  4 | # software and associated documentation files (the "Software"), to deal in the Software
  5 | # without restriction, including without limitation the rights to use, copy, modify,
  6 | # merge, publish, distribute, sublicense, and/or sell copies of the Software, and to
  7 | # permit persons to whom the Software is furnished to do so.
  8 | #
  9 | # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED,
 10 | # INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A
 11 | # PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
 12 | # HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
 13 | # OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
 14 | # SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
 15 | 
 16 | from __future__ import absolute_import
 17 | from __future__ import division
 18 | from __future__ import print_function
 19 | 
 20 | import argparse
 21 | import logging
 22 | import os
 23 | 
 24 | from keras.callbacks import ModelCheckpoint
 25 | from keras.layers import Activation, Conv2D, Dense, Dropout, Flatten, MaxPooling2D, BatchNormalization
 26 | from keras.models import Sequential
 27 | from keras.optimizers import Adam, SGD, RMSprop
 28 | import tensorflow as tf
 29 | from keras import backend as K
 30 | 
 31 | # ----- 추가 부분 -----
 32 | from keras.callbacks import TensorBoard
 33 | 
 34 | sess = tf.Session()
 35 | K.set_session(sess)
 36 | 
 37 | logging.getLogger().setLevel(logging.INFO)
 38 | tf.logging.set_verbosity(tf.logging.INFO)
 39 | HEIGHT = 32
 40 | WIDTH = 32
 41 | DEPTH = 3
 42 | NUM_CLASSES = 10
 43 | NUM_DATA_BATCHES = 5
 44 | NUM_EXAMPLES_PER_EPOCH_FOR_TRAIN = 10000 * NUM_DATA_BATCHES
 45 | INPUT_TENSOR_NAME = 'inputs_input'  # needs to match the name of the first layer + "_input"
 46 | 
 47 | def keras_model_fn(learning_rate, weight_decay, optimizer, momentum):
 48 |     """keras_model_fn receives hyperparameters from the training job and returns a compiled keras model.
 49 |     The model will be transformed into a TensorFlow Estimator before training and it will be saved in a 
 50 |     TensorFlow Serving SavedModel at the end of training.
 51 | 
 52 |     Args:
 53 |         hyperparameters: The hyperparameters passed to the SageMaker TrainingJob that runs your TensorFlow 
 54 |                          training script.
 55 |     Returns: A compiled Keras model
 56 |     """
 57 |     model = Sequential()
 58 |     model.add(Conv2D(32, (3, 3), padding='same', name='inputs', input_shape=(HEIGHT, WIDTH, DEPTH)))
 59 |     model.add(BatchNormalization())
 60 |     model.add(Activation('relu'))
 61 |     model.add(Conv2D(32, (3, 3)))
 62 |     model.add(BatchNormalization())
 63 |     model.add(Activation('relu'))
 64 |     model.add(MaxPooling2D(pool_size=(2, 2)))
 65 |     model.add(Dropout(0.2))
 66 | 
 67 |     model.add(Conv2D(64, (3, 3), padding='same'))
 68 |     model.add(BatchNormalization())
 69 |     model.add(Activation('relu'))
 70 |     model.add(Conv2D(64, (3, 3)))
 71 |     model.add(BatchNormalization())
 72 |     model.add(Activation('relu'))
 73 |     model.add(MaxPooling2D(pool_size=(2, 2)))
 74 |     model.add(Dropout(0.3))
 75 | 
 76 |     model.add(Conv2D(128, (3, 3), padding='same'))
 77 |     model.add(BatchNormalization())
 78 |     model.add(Activation('relu'))
 79 |     model.add(Conv2D(128, (3, 3)))
 80 |     model.add(BatchNormalization())
 81 |     model.add(Activation('relu'))
 82 |     model.add(MaxPooling2D(pool_size=(2, 2)))
 83 |     model.add(Dropout(0.4))
 84 | 
 85 |     model.add(Flatten())
 86 |     model.add(Dense(512))
 87 |     model.add(Activation('relu'))
 88 |     model.add(Dropout(0.5))
 89 |     model.add(Dense(NUM_CLASSES))
 90 |     model.add(Activation('softmax'))
 91 | 
 92 |     size = 1
 93 | 
 94 |     if optimizer.lower() == 'sgd':
 95 |         opt = SGD(lr=learning_rate * size, decay=weight_decay, momentum=momentum)
 96 |     elif optimizer.lower() == 'rmsprop':
 97 |         opt = RMSprop(lr=learning_rate * size, decay=weight_decay)
 98 |     else:
 99 |         opt = Adam(lr=learning_rate * size, decay=weight_decay)
100 | 
101 |     model.compile(loss='categorical_crossentropy',
102 |                   optimizer=opt,
103 |                   metrics=['accuracy'])
104 |     return model
105 | 
106 | 
107 | def get_filenames(channel_name, channel):
108 |     if channel_name in ['train', 'validation', 'eval']:
109 |         return [os.path.join(channel, channel_name + '.tfrecords')]
110 |     else:
111 |         raise ValueError('Invalid data subset "%s"' % channel_name)
112 | 
113 | 
114 | def train_input_fn():
115 |     return _input(args.epochs, args.batch_size, args.train, 'train')
116 | 
117 | 
118 | def eval_input_fn():
119 |     return _input(args.epochs, args.batch_size, args.eval, 'eval')
120 | 
121 | 
122 | def validation_input_fn():
123 |     return _input(args.epochs, args.batch_size, args.validation, 'validation')
124 | 
125 | 
126 | def _input(epochs, batch_size, channel, channel_name):
127 | 
128 |     filenames = get_filenames(channel_name, channel)
129 |     dataset = tf.data.TFRecordDataset(filenames)
130 | 
131 |     dataset = dataset.repeat(epochs)
132 |     dataset = dataset.prefetch(10)
133 | 
134 |     # Parse records.
135 |     dataset = dataset.map(
136 |         _dataset_parser, num_parallel_calls=10)
137 | 
138 |     # Potentially shuffle records.
139 |     if channel_name == 'train':
140 |         # Ensure that the capacity is sufficiently large to provide good random
141 |         # shuffling.
142 |         buffer_size = int(NUM_EXAMPLES_PER_EPOCH_FOR_TRAIN * 0.4) + 3 * batch_size
143 |         dataset = dataset.shuffle(buffer_size=buffer_size)
144 | 
145 |     # Batch it up.
146 |     dataset = dataset.batch(batch_size, drop_remainder=True)
147 |     iterator = dataset.make_one_shot_iterator()
148 |     image_batch, label_batch = iterator.get_next()
149 | 
150 |     return {INPUT_TENSOR_NAME: image_batch}, label_batch
151 | 
152 | 
153 | def _train_preprocess_fn(image):
154 |     """Preprocess a single training image of layout [height, width, depth]."""
155 |     # Resize the image to add four extra pixels on each side.
156 |     image = tf.image.resize_image_with_crop_or_pad(image, HEIGHT + 8, WIDTH + 8)
157 | 
158 |     # Randomly crop a [HEIGHT, WIDTH] section of the image.
159 |     image = tf.random_crop(image, [HEIGHT, WIDTH, DEPTH])
160 | 
161 |     # Randomly flip the image horizontally.
162 |     image = tf.image.random_flip_left_right(image)
163 | 
164 |     return image
165 | 
166 | 
167 | def _dataset_parser(value):
168 |     """Parse a CIFAR-10 record from value."""
169 |     featdef = {
170 |         'image': tf.FixedLenFeature([], tf.string),
171 |         'label': tf.FixedLenFeature([], tf.int64),
172 |     }
173 | 
174 |     example = tf.parse_single_example(value, featdef)
175 |     image = tf.decode_raw(example['image'], tf.uint8)
176 |     image.set_shape([DEPTH * HEIGHT * WIDTH])
177 | 
178 |     # Reshape from [depth * height * width] to [depth, height, width].
179 |     image = tf.cast(
180 |         tf.transpose(tf.reshape(image, [DEPTH, HEIGHT, WIDTH]), [1, 2, 0]),
181 |         tf.float32)
182 |     label = tf.cast(example['label'], tf.int32)
183 |     image = _train_preprocess_fn(image)
184 |     return image, tf.one_hot(label, NUM_CLASSES)
185 | 
186 | def save_model(model, output):
187 |     signature = tf.saved_model.signature_def_utils.predict_signature_def(
188 |         inputs={'inputs': model.input}, outputs={'scores': model.output})
189 | 
190 |     builder = tf.saved_model.builder.SavedModelBuilder(output+'/1/')
191 |     builder.add_meta_graph_and_variables(
192 |         sess=K.get_session(),
193 |         tags=[tf.saved_model.tag_constants.SERVING],
194 |         signature_def_map={"serving_default": signature})
195 |     builder.save()
196 | 
197 |     logging.info("Model successfully saved at: {}".format(output))
198 |     return
199 | 
200 | def main(args):
201 |     logging.info("getting data")
202 |     train_dataset = train_input_fn()
203 |     eval_dataset = eval_input_fn()
204 |     validation_dataset = validation_input_fn()
205 | 
206 |     logging.info("configuring model")
207 |     model = keras_model_fn(args.learning_rate, args.weight_decay, args.optimizer, args.momentum)
208 |     callbacks = []
209 |         
210 |     # ----- 수정 부분 (경로 수정) -----
211 |     callbacks.append(ModelCheckpoint(args.model_output_dir + '/checkpoint-{epoch}.h5'))
212 |     # ----- 추가 부분 -----    
213 |     callbacks.append(TensorBoard(log_dir = args.model_output_dir, update_freq = 'epoch'))    
214 | 
215 |     logging.info("Starting training")
216 |     model.fit(x=train_dataset[0], y=train_dataset[1],
217 |               steps_per_epoch=(num_examples_per_epoch('train') // args.batch_size),
218 |               epochs=args.epochs, validation_data=validation_dataset,
219 |               validation_steps=(num_examples_per_epoch('validation') // args.batch_size), callbacks=callbacks)
220 | 
221 |     score = model.evaluate(eval_dataset[0], eval_dataset[1], steps=num_examples_per_epoch('eval') // args.batch_size,
222 |                            verbose=0)
223 | 
224 |     logging.info('Test loss:{}'.format(score[0]))
225 |     logging.info('Test accuracy:{}'.format(score[1]))
226 | 
227 |     # ----- 수정 부분 (경로 수정) -----  
228 |     return save_model(model, args.model_output_dir)
229 | 
230 | def num_examples_per_epoch(subset='train'):
231 |     if subset == 'train':
232 |         return 40000
233 |     elif subset == 'validation':
234 |         return 10000
235 |     elif subset == 'eval':
236 |         return 10000
237 |     else:
238 |         raise ValueError('Invalid data subset "%s"' % subset)
239 | 
240 |         
241 | if __name__ == '__main__':
242 |     parser = argparse.ArgumentParser()
243 | 
244 |     parser.add_argument(
245 |         '--train',
246 |         type=str,
247 |         required=False,
248 |         default=os.environ['SM_CHANNEL_TRAIN'], # ----- 수정 부분 (default 경로 수정) -----
249 |         help='The directory where the CIFAR-10 input data is stored.')    
250 |     parser.add_argument(
251 |         '--validation',
252 |         type=str,
253 |         required=False,
254 |         default=os.environ['SM_CHANNEL_VALIDATION'], # ----- 수정 부분 (default 경로 수정) -----
255 |         help='The directory where the CIFAR-10 input data is stored.')
256 |     parser.add_argument(
257 |         '--eval',
258 |         type=str,
259 |         required=False,
260 |         default=os.environ['SM_CHANNEL_EVAL'], # ----- 수정 부분 (default 경로 수정) -----
261 |         help='The directory where the CIFAR-10 input data is stored.')
262 |     
263 |     # ----- 수정 부분 (argument 추가) -----
264 |     parser.add_argument(
265 |         '--model_output_dir',
266 |         type=str,
267 |         default=os.environ.get('SM_MODEL_DIR'))  
268 |     
269 |     parser.add_argument(
270 |         '--model_dir',
271 |         type=str,
272 |         required=True,
273 |         help='The directory where the model will be stored.')
274 |     
275 |     parser.add_argument(
276 |         '--weight-decay',
277 |         type=float,
278 |         default=2e-4,
279 |         help='Weight decay for convolutions.')
280 |     parser.add_argument(
281 |         '--learning-rate',
282 |         type=float,
283 |         default=0.001,
284 |         help="""\
285 |         This is the inital learning rate value. The learning rate will decrease
286 |         during training. For more details check the model_fn implementation in
287 |         this file.\
288 |         """)
289 |     parser.add_argument(
290 |         '--epochs',
291 |         type=int,
292 |         default=10,
293 |         help='The number of steps to use for training.')
294 |     parser.add_argument(
295 |         '--batch-size',
296 |         type=int,
297 |         default=128,
298 |         help='Batch size for training.')
299 |     parser.add_argument(
300 |         '--optimizer',
301 |         type=str,
302 |         default='adam')
303 |     parser.add_argument(
304 |         '--momentum',
305 |         type=float,
306 |         default='0.9')
307 |     args = parser.parse_args()
308 |     main(args)
309 | 


--------------------------------------------------------------------------------
/training_script/cifar10_keras_tf2.py:
--------------------------------------------------------------------------------
  1 | # Copyright 2019 Amazon.com, Inc. or its affiliates. All Rights Reserved.
  2 | #
  3 | # Permission is hereby granted, free of charge, to any person obtaining a copy of this
  4 | # software and associated documentation files (the "Software"), to deal in the Software
  5 | # without restriction, including without limitation the rights to use, copy, modify,
  6 | # merge, publish, distribute, sublicense, and/or sell copies of the Software, and to
  7 | # permit persons to whom the Software is furnished to do so.
  8 | #
  9 | # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED,
 10 | # INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A
 11 | # PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
 12 | # HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
 13 | # OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
 14 | # SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
 15 | 
 16 | from __future__ import absolute_import
 17 | from __future__ import division
 18 | from __future__ import print_function
 19 | 
 20 | import argparse
 21 | import logging
 22 | import os
 23 | 
 24 | import tensorflow as tf
 25 | from tensorflow.keras.callbacks import ModelCheckpoint
 26 | from tensorflow.keras.layers import Activation, Conv2D, Dense, Dropout, Flatten, MaxPooling2D, BatchNormalization
 27 | from tensorflow.keras.models import Sequential
 28 | from tensorflow.keras.optimizers import Adam, SGD, RMSprop
 29 | tf.get_logger().setLevel('INFO')
 30 | #tf.autograph.set_verbosity(1)
 31 | 
 32 | HEIGHT = 32
 33 | WIDTH = 32
 34 | DEPTH = 3
 35 | NUM_CLASSES = 10
 36 | NUM_DATA_BATCHES = 5
 37 | NUM_EXAMPLES_PER_EPOCH_FOR_TRAIN = 10000 * NUM_DATA_BATCHES
 38 | INPUT_TENSOR_NAME = 'inputs_input'  # needs to match the name of the first layer + "_input"
 39 | 
 40 | def keras_model_fn(learning_rate, weight_decay, optimizer, momentum):
 41 |     """keras_model_fn receives hyperparameters from the training job and returns a compiled keras model.
 42 |     The model will be transformed into a TensorFlow Estimator before training and it will be saved in a 
 43 |     TensorFlow Serving SavedModel at the end of training.
 44 | 
 45 |     Args:
 46 |         hyperparameters: The hyperparameters passed to the SageMaker TrainingJob that runs your TensorFlow 
 47 |                          training script.
 48 |     Returns: A compiled Keras model
 49 |     """
 50 |     model = Sequential()
 51 |     model.add(Conv2D(32, (3, 3), padding='same', name='inputs', input_shape=(HEIGHT, WIDTH, DEPTH)))
 52 |     model.add(BatchNormalization())
 53 |     model.add(Activation('relu'))
 54 |     model.add(Conv2D(32, (3, 3)))
 55 |     model.add(BatchNormalization())
 56 |     model.add(Activation('relu'))
 57 |     model.add(MaxPooling2D(pool_size=(2, 2)))
 58 |     model.add(Dropout(0.2))
 59 | 
 60 |     model.add(Conv2D(64, (3, 3), padding='same'))
 61 |     model.add(BatchNormalization())
 62 |     model.add(Activation('relu'))
 63 |     model.add(Conv2D(64, (3, 3)))
 64 |     model.add(BatchNormalization())
 65 |     model.add(Activation('relu'))
 66 |     model.add(MaxPooling2D(pool_size=(2, 2)))
 67 |     model.add(Dropout(0.3))
 68 | 
 69 |     model.add(Conv2D(128, (3, 3), padding='same'))
 70 |     model.add(BatchNormalization())
 71 |     model.add(Activation('relu'))
 72 |     model.add(Conv2D(128, (3, 3)))
 73 |     model.add(BatchNormalization())
 74 |     model.add(Activation('relu'))
 75 |     model.add(MaxPooling2D(pool_size=(2, 2)))
 76 |     model.add(Dropout(0.4))
 77 | 
 78 |     model.add(Flatten())
 79 |     model.add(Dense(512))
 80 |     model.add(Activation('relu'))
 81 |     model.add(Dropout(0.5))
 82 |     model.add(Dense(NUM_CLASSES))
 83 |     model.add(Activation('softmax'))
 84 | 
 85 |     size = 1
 86 | 
 87 |     if optimizer.lower() == 'sgd':
 88 |         opt = SGD(lr=learning_rate * size, decay=weight_decay, momentum=momentum)
 89 |     elif optimizer.lower() == 'rmsprop':
 90 |         opt = RMSprop(lr=learning_rate * size, decay=weight_decay)
 91 |     else:
 92 |         opt = Adam(lr=learning_rate * size, decay=weight_decay)
 93 | 
 94 |     model.compile(loss='categorical_crossentropy',
 95 |                   optimizer=opt,
 96 |                   metrics=['accuracy'])
 97 |     return model
 98 | 
 99 | 
100 | def get_filenames(channel_name, channel):
101 |     if channel_name in ['train', 'validation', 'eval']:
102 |         return [os.path.join(channel, channel_name + '.tfrecords')]
103 |     else:
104 |         raise ValueError('Invalid data subset "%s"' % channel_name)
105 | 
106 | 
107 | def train_input_fn():
108 |     return _input(args.epochs, args.batch_size, args.train, 'train')
109 | 
110 | 
111 | def eval_input_fn():
112 |     return _input(args.epochs, args.batch_size, args.eval, 'eval')
113 | 
114 | 
115 | def validation_input_fn():
116 |     return _input(args.epochs, args.batch_size, args.validation, 'validation')
117 | 
118 | 
119 | def _input(epochs, batch_size, channel, channel_name):
120 | 
121 |     filenames = get_filenames(channel_name, channel)
122 |     dataset = tf.data.TFRecordDataset(filenames)
123 |     #dataset = dataset.interleave(tf.data.TFRecordDataset, cycle_length=3)
124 | 
125 |     # Parse records.
126 |     dataset = dataset.map(_dataset_parser, num_parallel_calls=10)
127 | 
128 |     dataset = dataset.repeat()    
129 |     
130 |     # Potentially shuffle records.
131 |     if channel_name == 'train':
132 |         # Ensure that the capacity is sufficiently large to provide good random
133 |         # shuffling.
134 |         buffer_size = int(NUM_EXAMPLES_PER_EPOCH_FOR_TRAIN * 0.4) + 3 * batch_size
135 |         dataset = dataset.shuffle(buffer_size=buffer_size)
136 |  
137 |     # Batch it up.
138 |     dataset = dataset.batch(batch_size, drop_remainder=True)
139 |     dataset = dataset.prefetch(tf.data.experimental.AUTOTUNE)
140 | 
141 |     return dataset
142 | 
143 | 
144 | def _train_preprocess_fn(image):
145 |     """Preprocess a single training image of layout [height, width, depth]."""
146 |     # Resize the image to add four extra pixels on each side.
147 |     image = tf.image.resize_with_crop_or_pad(image, HEIGHT + 8, WIDTH + 8)
148 | 
149 |     # Randomly crop a [HEIGHT, WIDTH] section of the image.
150 |     image = tf.image.random_crop(image, [HEIGHT, WIDTH, DEPTH])
151 | 
152 |     # Randomly flip the image horizontally.
153 |     image = tf.image.random_flip_left_right(image)
154 | 
155 |     return image
156 | 
157 | 
158 | def _dataset_parser(value):
159 |     """Parse a CIFAR-10 record from value."""
160 |     featdef = {
161 |         'image': tf.io.FixedLenFeature([], tf.string),
162 |         'label': tf.io.FixedLenFeature([], tf.int64),
163 |     }
164 | 
165 |     example = tf.io.parse_single_example(value, featdef)
166 |     image = tf.io.decode_raw(example['image'], tf.uint8)
167 |     image.set_shape([DEPTH * HEIGHT * WIDTH])
168 | 
169 |     # Reshape from [depth * height * width] to [depth, height, width].
170 |     image = tf.cast(
171 |         tf.transpose(tf.reshape(image, [DEPTH, HEIGHT, WIDTH]), [1, 2, 0]),
172 |         tf.float32)
173 |     label = tf.cast(example['label'], tf.int32)
174 |     image = _train_preprocess_fn(image)
175 |     return image, tf.one_hot(label, NUM_CLASSES)
176 | 
177 | def save_model(model, output):
178 |     tf.saved_model.save(model, output+'/1/')
179 |     logging.info("Model successfully saved at: {}".format(output))
180 |     return
181 | 
182 | def main(args):
183 |     logging.info("getting data")
184 |     train_dataset = train_input_fn()
185 |     eval_dataset = eval_input_fn()
186 |     validation_dataset = validation_input_fn()
187 | 
188 |     logging.info("configuring model")
189 |     model = keras_model_fn(args.learning_rate, args.weight_decay, args.optimizer, args.momentum)
190 |     callbacks = []
191 | 
192 |     callbacks.append(ModelCheckpoint(args.model_dir + '/checkpoint-{epoch}.h5'))
193 | 
194 |     logging.info("Starting training")
195 |     model.fit(train_dataset,
196 |               steps_per_epoch=(num_examples_per_epoch('train') // args.batch_size),
197 |               epochs=args.epochs, 
198 |               validation_data=validation_dataset,
199 |               validation_steps=(num_examples_per_epoch('validation') // args.batch_size),
200 |               callbacks=callbacks)
201 | 
202 |     score = model.evaluate(eval_dataset, steps=num_examples_per_epoch('eval') // args.batch_size,
203 |                            verbose=0)
204 | 
205 |     logging.info('Test loss:{}'.format(score[0]))
206 |     logging.info('Test accuracy:{}'.format(score[1]))
207 | 
208 |     return save_model(model, args.model_dir)
209 | 
210 | def num_examples_per_epoch(subset='train'):
211 |     if subset == 'train':
212 |         return 40000
213 |     elif subset == 'validation':
214 |         return 10000
215 |     elif subset == 'eval':
216 |         return 10000
217 |     else:
218 |         raise ValueError('Invalid data subset "%s"' % subset)
219 | 
220 |                  
221 | if __name__ == '__main__':
222 |     parser = argparse.ArgumentParser()
223 |     parser.add_argument(
224 |         '--train',
225 |         type=str,
226 |         required=False,
227 |         help='The directory where the CIFAR-10 input data is stored.')
228 |     parser.add_argument(
229 |         '--validation',
230 |         type=str,
231 |         required=False,
232 |         help='The directory where the CIFAR-10 input data is stored.')
233 |     parser.add_argument(
234 |         '--eval',
235 |         type=str,
236 |         required=False,
237 |         help='The directory where the CIFAR-10 input data is stored.')
238 |     parser.add_argument(
239 |         '--model_dir',
240 |         type=str,
241 |         required=True,
242 |         help='The directory where the model will be stored.')
243 |     parser.add_argument(
244 |         '--weight-decay',
245 |         type=float,
246 |         default=2e-4,
247 |         help='Weight decay for convolutions.')
248 |     parser.add_argument(
249 |         '--learning-rate',
250 |         type=float,
251 |         default=0.001,
252 |         help="""\
253 |         This is the inital learning rate value. The learning rate will decrease
254 |         during training. For more details check the model_fn implementation in
255 |         this file.\
256 |         """)
257 |     parser.add_argument(
258 |         '--epochs',
259 |         type=int,
260 |         default=10,
261 |         help='The number of steps to use for training.')
262 |     parser.add_argument(
263 |         '--batch-size',
264 |         type=int,
265 |         default=128,
266 |         help='Batch size for training.')
267 |     parser.add_argument(
268 |         '--optimizer',
269 |         type=str,
270 |         default='adam')
271 |     parser.add_argument(
272 |         '--momentum',
273 |         type=float,
274 |         default='0.9')
275 |     args = parser.parse_args()
276 |     main(args)


--------------------------------------------------------------------------------