├── README.md └── mnist ├── README.en.md ├── README.md ├── __init__.py ├── average_precision_calculator.py ├── cloudml-4gpu.yaml ├── cloudml-gpu-distributed.yaml ├── cloudml-gpu.yaml ├── convert_prediction_from_json_to_csv.py ├── eval.py ├── eval_util.py ├── export_model.py ├── inference.py ├── losses.py ├── mean_average_precision_calculator.py ├── mnist_models.py ├── models.py ├── readers.py ├── train.py └── utils.py /README.md: -------------------------------------------------------------------------------- 1 | # Tutorial repository for Machine Learning Challenge 2017 2 | 3 | This page contains the tutorial project for Machine Learning Challenge 2017. 4 | -------------------------------------------------------------------------------- /mnist/README.en.md: -------------------------------------------------------------------------------- 1 | # MNIST Tensorflow Starter Code 2 | 3 | 한국어 버전 설명은 다음 링크를 따라가주세요: [Link](README.md) 4 | 5 | This repo contains starter code for training and evaluating machine learning 6 | models over the mnist dataset. 7 | 8 | The code gives an end-to-end working example for reading the dataset, training a 9 | TensorFlow model, and evaluating the performance of the model. Out of the box, 10 | you can train several [model architectures](#overview-of-models) over features. 11 | The code can easily be extended to train your own custom-defined models. 12 | 13 | It is possible to train and evaluate on mnist in two ways: on Google Cloud 14 | or on your own machine. This README provides instructions for both. 15 | 16 | ## Table of Contents 17 | * [Running on Google's Cloud Machine Learning Platform](#running-on-googles-cloud-machine-learning-platform) 18 | * [Requirements](#requirements) 19 | * [Testing Locally](#testing-locally) 20 | * [Training on the Cloud](#training-on-features) 21 | * [Evaluation and Inference](#evaluation-and-inference) 22 | * [Inference Using Batch Prediction](#inference-using-batch-prediction) 23 | * [Accessing Files on Google Cloud](#accessing-files-on-google-cloud) 24 | * [Using Larger Machine Types](#using-larger-machine-types) 25 | * [Running on Your Own Machine](#running-on-your-own-machine) 26 | * [Requirements](#requirements-1) 27 | * [Overview of Models](#overview-of-models) 28 | * [Overview of Files](#overview-of-files) 29 | * [Training](#training) 30 | * [Evaluation](#evaluation) 31 | * [Inference](#inference) 32 | * [Misc](#misc) 33 | * [TODO for participants](#todo-for-participants) 34 | * [Etc](#etc) 35 | * [About This Project](#about-this-project) 36 | 37 | ## Running on Google's Cloud Machine Learning Platform 38 | 39 | ### Requirements 40 | 41 | This option requires you to have an appropriately configured Google Cloud 42 | Platform account. To create and configure your account, please make sure you 43 | follow the instructions [here](https://cloud.google.com/ml/docs/how-tos/getting-set-up). 44 | 45 | Please also verify that you have Python 2.7+ and Tensorflow 1.0.1 or higher 46 | installed by running the following commands: 47 | 48 | ```sh 49 | python --version 50 | python -c 'import tensorflow as tf; print(tf.__version__)' 51 | ``` 52 | 53 | ### Testing Locally 54 | All gcloud commands should be done from the directory *immediately above* the 55 | source code. You should be able to see the source code directory if you 56 | run 'ls'. 57 | 58 | As you are developing your own models, you will want to test them 59 | quickly to flush out simple problems without having to submit them to the cloud. 60 | You can use the `gcloud beta ml local` set of commands for that. 61 | Here is an example command line: 62 | 63 | ```sh 64 | gcloud ml-engine local train \ 65 | --package-path=mnist --module-name=mnist.train -- \ 66 | --train_data_pattern='gs://kmlc_test_train_bucket/mnist/train.tfrecords' \ 67 | --train_dir=/tmp/kmlc_mnist_train --model=LogisticModel --start_new_model 68 | ``` 69 | 70 | You might want to download the training data files to the current directory. 71 | 72 | ```sh 73 | gsutil cp gs://kmlc_test_train_bucket/mnist/train.tfrecords . 74 | ``` 75 | 76 | Once you download the files, you can point the job to them using the 77 | 'train_data_pattern' argument (i.e. instead of pointing to the "gs://..." 78 | files, you point to the local files). 79 | 80 | Once your model is working locally, you can scale up on the Cloud 81 | which is described below. 82 | 83 | ### Training on the Cloud 84 | 85 | You'll use Google Cloud to access the training and test files. You'll also be given free credits to try out Google Cloud. Below are some step-by-step tutorials to set up and get data, and submit training/testing jobs to Google Cloud ML. 86 | 87 | #### Set up your Google Cloud project 88 | 89 | 1. Create a new Cloud Platform project. This is where your project lives. 90 | - Click Create Project and follow instructions. 91 | - Enable billing for your project. This links a billing method to your project. For a new account, you will already have $300 in trial credits within your default billing account. 92 | - [Enable the APIs](https://console.cloud.google.com/flows/enableapi?apiid=ml.googleapis.com,dataflow,compute_component,logging,storage_component,storage_api,bigquery) but ignore adding Credentials. This enables the set of Cloud APIs that are needed for Cloud ML functionality such as Cloud Logging to get your training logs. Other APIs include: Cloud Machine Learning, Dataflow, Compute Engine, Cloud Storage, Cloud Storage JSON, and BigQuery APIs. 93 | - You may have to select the newly-created project. 94 | - Expect this to take some time. 95 | - After the APIs are enabled, do not “Go to Credentials” 96 | 2. Set up your environment using cloud shell 97 | - There are three paths to use Google Cloud for this competition: Cloud shell, local (Mac/Linux), & Docker. To start we recommend the cloud shell to avoid having to set up a local environment. 98 | - Before you click the cloud shell button, make sure that you have already selected your newly-created project (in the screenshot example, the project name is My First Project) 99 | - You can start a cloud shell by clicking on the cloud shell icon shown in the screenshot below. 100 | 101 | - You should run all of the following commands inside of the cloud shell command line. 102 | - The first step to setting up the environment is to configure the gcloud command-line tool to use your selected project, where [selected-project-id] is your project id, without the enclosing brackets. 103 | ``` sh 104 | gcloud config set project [selected-project-id] 105 | ``` 106 | - Python version should be 2.7+ 107 | ``` sh 108 | $ python --version 109 | Python 2.7.9 110 | ``` 111 | - Install the latest version of TensorFlow (1.2.1) with the following 2 command lines. 112 | ```sh 113 | pip download tensorflow 114 | pip install --user -U tensorflow*.whl 115 | ``` 116 | 3. Verify the Google Cloud SDK Components 117 | - List the models to verify that the command returns an empty list. 118 | ```sh 119 | gcloud ml-engine models list 120 | ``` 121 | - The command will an empty list, and after you start creating models, you can see them listed by using this command. 122 | 123 | #### Running training 124 | 125 | The following commands will train a model on Google Cloud. Following commands are need to be executed on Google Cloud Shell. 126 | 127 | ```sh 128 | BUCKET_NAME=gs://${USER}_kmlc_mnist_train_bucket 129 | # (One Time) Create a storage bucket to store training logs and checkpoints. 130 | gsutil mb -l us-east1 $BUCKET_NAME 131 | # Submit the training job. 132 | JOB_NAME=kmlc_mnist_train_$(date +%Y%m%d_%H%M%S); gcloud --verbosity=debug ml-engine jobs \ 133 | submit training $JOB_NAME \ 134 | --package-path=mnist --module-name=mnist.train \ 135 | --staging-bucket=$BUCKET_NAME --region=us-east1 \ 136 | --config=mnist/cloudml-gpu.yaml \ 137 | -- --train_data_pattern='gs://kmlc_test_train_bucket/mnist/train.tfrecords' \ 138 | --model=LogisticModel \ 139 | --train_dir=$BUCKET_NAME/kmlc_mnist_train_logistic_model 140 | ``` 141 | 142 | In the 'gsutil' command above, the 'package-path' flag refers to the directory 143 | containing the 'train.py' script and more generally the python package which 144 | should be deployed to the cloud worker. The module-name refers to the specific 145 | python script which should be executed (in this case the train module). 146 | 147 | It may take several minutes before the job starts running on Google Cloud. 148 | When it starts you will see outputs like the following: 149 | 150 | ``` 151 | training step 270| Hit@1: 0.68 PERR: 0.52 Loss: 638.453 152 | training step 271| Hit@1: 0.66 PERR: 0.49 Loss: 635.537 153 | training step 272| Hit@1: 0.70 PERR: 0.52 Loss: 637.564 154 | ``` 155 | 156 | At this point you can disconnect your console by pressing "ctrl-c". The 157 | model will continue to train indefinitely in the Cloud. Later, you can check 158 | on its progress or halt the job by visiting the 159 | [Google Cloud ML Jobs console](https://console.cloud.google.com/ml/jobs). 160 | 161 | You can train many jobs at once and use tensorboard to compare their performance 162 | visually. 163 | 164 | ```sh 165 | tensorboard --logdir=$BUCKET_NAME --port=8080 166 | ``` 167 | 168 | Once tensorboard is running, you can access it at the following url: 169 | [http://localhost:8080](http://localhost:8080). 170 | If you are using Google Cloud Shell, you can instead click the Web Preview button 171 | on the upper left corner of the Cloud Shell window and select "Preview on port 8080". 172 | This will bring up a new browser tab with the Tensorboard view. 173 | 174 | ### Evaluation and Inference 175 | Here's how to evaluate a model on the validation dataset: 176 | 177 | ```sh 178 | JOB_TO_EVAL=kmlc_mnist_train_logistic_model 179 | JOB_NAME=kmlc_mnist_eval_$(date +%Y%m%d_%H%M%S); gcloud --verbosity=debug ml-engine jobs \ 180 | submit training $JOB_NAME \ 181 | --package-path=mnist --module-name=mnist.eval \ 182 | --staging-bucket=$BUCKET_NAME --region=us-east1 \ 183 | --config=mnist/cloudml-gpu.yaml \ 184 | -- --eval_data_pattern='gs://kmlc_test_train_bucket/mnist/validation.tfrecords' \ 185 | --model=LogisticModel \ 186 | --train_dir=$BUCKET_NAME/${JOB_TO_EVAL} --run_once=True 187 | ``` 188 | 189 | And here's how to perform inference with a model on the test set: 190 | 191 | ```sh 192 | JOB_TO_EVAL=kmlc_mnist_train_logistic_model 193 | JOB_NAME=kmlc_mnist_inference_$(date +%Y%m%d_%H%M%S); gcloud --verbosity=debug ml-engine jobs \ 194 | submit training $JOB_NAME \ 195 | --package-path=mnist --module-name=mnist.inference \ 196 | --staging-bucket=$BUCKET_NAME --region=us-east1 \ 197 | --config=mnist/cloudml-gpu.yaml \ 198 | -- --input_data_pattern='gs://kmlc_test_train_bucket/mnist/test.tfrecords' \ 199 | --train_dir=$BUCKET_NAME/${JOB_TO_EVAL} \ 200 | --output_file=$BUCKET_NAME/${JOB_TO_EVAL}/predictions.csv 201 | ``` 202 | 203 | Note the confusing use of 'training' in the above gcloud commands. Despite the 204 | name, the 'training' argument really just offers a cloud hosted 205 | python/tensorflow service. From the point of view of the Cloud Platform, there 206 | is no distinction between our training and inference jobs. The Cloud ML platform 207 | also offers specialized functionality for prediction with 208 | Tensorflow models, but discussing that is beyond the scope of this readme. 209 | 210 | Once these job starts executing you will see outputs similar to the 211 | following for the evaluation code: 212 | 213 | ``` 214 | examples_processed: 1024 | global_step 447044 | Batch Hit@1: 0.782 | Batch PERR: 0.637 | Batch Loss: 7.821 | Examples_per_sec: 834.658 215 | ``` 216 | 217 | and the following for the inference code: 218 | 219 | ``` 220 | num examples processed: 8192 elapsed seconds: 14.85 221 | ``` 222 | 223 | ### Inference Using Batch Prediction 224 | To perform inference faster, you can also use the Cloud ML batch prediction 225 | service. 226 | 227 | First, find the directory where the training job exported the model: 228 | 229 | ``` 230 | gsutil list ${BUCKET_NAME}/kmlc_mnist_train_logistic_model/export 231 | ``` 232 | 233 | You should see an output similar to this one: 234 | 235 | ``` 236 | ${BUCKET_NAME}/kmlc_mnist_train_logistic_model/export/ 237 | ${BUCKET_NAME}/kmlc_mnist_train_logistic_model/export/step_1/ 238 | ${BUCKET_NAME}/kmlc_mnist_train_logistic_model/export/step_1001/ 239 | ${BUCKET_NAME}/kmlc_mnist_train_logistic_model/export/step_2001/ 240 | ${BUCKET_NAME}/kmlc_mnist_train_logistic_model/export/step_3001/ 241 | ``` 242 | 243 | Select the latest version of the model that was saved. For instance, in our 244 | case, we select the version of the model that was saved at step 3001: 245 | 246 | ``` 247 | EXPORTED_MODEL_DIR=${BUCKET_NAME}/kmlc_mnist_train_logistic_model/export/step_3001/ 248 | ``` 249 | 250 | Start the batch prediction job using the following command: 251 | 252 | ``` 253 | JOB_NAME=kmlc_mnist_batch_predict_$(date +%Y%m%d_%H%M%S); \ 254 | gcloud ml-engine jobs submit prediction ${JOB_NAME} --verbosity=debug \ 255 | --model-dir=${EXPORTED_MODEL_DIR} --data-format=TF_RECORD \ 256 | --input-paths='gs://kmlc_test_train_bucket/mnist/test.tfrecords' \ 257 | --output-path=${BUCKET_NAME}/batch_predict/${JOB_NAME}.csv --region=us-east1 \ 258 | --runtime-version=1.2 --max-worker-count=10 259 | ``` 260 | 261 | You can check the progress of the job on the 262 | [Google Cloud ML Jobs console](https://console.cloud.google.com/ml/jobs). To 263 | have the job complete faster, you can increase 'max-worker-count' to a 264 | higher value. 265 | 266 | Once the batch prediction job has completed, turn its output into a submission 267 | in the CVS format by running the following commands: 268 | 269 | ``` 270 | # Copy the output of the batch prediction job to a local directory 271 | mkdir -p /tmp/batch_predict/${JOB_NAME} 272 | gsutil -m cp -r ${BUCKET_NAME}/batch_predict/${JOB_NAME}.csv/* /tmp/batch_predict/${JOB_NAME}.csv 273 | ``` 274 | 275 | Submit the resulting file /tmp/batch_predict/${JOB_NAME}.csv to [Kaggle](https://inclass.kaggle.com/c/mnist-tutorial-machine-learning-challenge). 276 | 277 | ### Accessing Files on Google Cloud 278 | 279 | You can browse the storage buckets you created on Google Cloud, for example, to 280 | access the trained models, prediction CSV files, etc. by visiting the 281 | [Google Cloud storage browser](https://console.cloud.google.com/storage/browser). 282 | 283 | Alternatively, you can use the 'gsutil' command to download the files directly. 284 | For example, to download the output of the inference code from the previous 285 | section to your local machine, run: 286 | 287 | 288 | ``` 289 | gsutil cp $BUCKET_NAME/${JOB_TO_EVAL}/predictions.csv . 290 | ``` 291 | 292 | 293 | ### Using Larger Machine Types 294 | Some complex models can take as long as a week to converge when 295 | using only one GPU. You can train these models more quickly by using more 296 | powerful machine types which have additional GPUs. To use a configuration with 297 | 4 GPUs, replace the argument to `--config` with `mnist/cloudml-4gpu.yaml`. 298 | Be careful with this argument as it will also increase the rate you are charged 299 | by a factor of 4 as well. 300 | 301 | ## Running on Your Own Machine 302 | 303 | ### Requirements 304 | 305 | The starter code requires Tensorflow. If you haven't installed it yet, follow 306 | the instructions on [tensorflow.org](https://www.tensorflow.org/install/). 307 | This code has been tested with Tensorflow 1.0.1. Going forward, we will continue 308 | to target the latest released version of Tensorflow. 309 | 310 | Please verify that you have Python 2.7+ and Tensorflow 1.0.1 or higher 311 | installed by running the following commands: 312 | 313 | ```sh 314 | python --version 315 | python -c 'import tensorflow as tf; print(tf.__version__)' 316 | ``` 317 | 318 | Downloading files 319 | ``` sh 320 | gsutil cp gs://kmlc_test_train_bucket/mnist/train* . 321 | gsutil cp gs://kmlc_test_train_bucket/mnist/test* . 322 | gsutil cp gs://kmlc_test_train_bucket/mnist/validation* . 323 | ``` 324 | 325 | Training 326 | ```sh 327 | python train.py --train_data_pattern='/path/to/training/files/*' --train_dir=/tmp/mnist_train --model=LogisticModel --start_new_model 328 | ``` 329 | 330 | Validation 331 | ```sh 332 | python eval.py --eval_data_pattern='/path/to/validation/files' --train_dir=/tmp/mnist_train --model=LogisticModel --run_once=True 333 | ``` 334 | 335 | Generating submission 336 | ```sh 337 | python inference.py --output_file=/path/to/predictions.csv --input_data_pattern='/path/to/test/files/*' --train_dir=/tmp/mnist_train 338 | ``` 339 | 340 | ## Overview of Models 341 | 342 | This sample code contains implementation of the logistic model: 343 | 344 | * `LogisticModel`: Linear projection of the output features into the label 345 | space, followed by a sigmoid function to convert logit 346 | values to probabilities. 347 | 348 | ## Overview of Files 349 | 350 | ### Training 351 | * `train.py`: Defines the parameters and procedures for training. You can modify the parameters such as the location of training dataset, the model to be used for training, the batch size, the loss function to be used, the learning rate, etc. Depending on the model, you may want to modify get_input_data_tensors() on how data is shuffled. 352 | * `losses.py`: Defines the loss functions. You can call train.py to use any of the loss functions defined in losses.py. 353 | * `models.py`: Contains the base class for defining a model. 354 | * `mnist_models.py`: Contains the definition for models that that take the aggregated features as input, and you should add your own models here. You can invoke any model by calling train.py using --model=YourModelName. 355 | * `export_model.py`: Provides a class to export a model during training for later use in batch prediction. 356 | * `readers.py`: Contains the definition of the dataset, and describes how input data are prepared. You can preprocess the input files by modifying prepare_serialized_examples(). For example, you may want to resize the data or introduce some random noise. 357 | 358 | ### Evaluation 359 | * `eval.py`: The primary script for evaluating models. Once the model is trained, you can call eval.py with the --train_dir=/path/to/model and --model=YourModelName to load your model from the files in train_dir. Most likely you do not need to modify this file. 360 | * `eval_util.py`: Provides a class that calculates all evaluation metrics. 361 | * `average_precision_calculator.py`: Functions for calculating average precision. 362 | * `mean_average_precision_calculator.py`: Functions for calculating mean average precision. 363 | 364 | ### Inference 365 | * `inference.py`: Generates an output file containing predictions of the model over a set of data. Call inference.py on the test data to generate a list of predicted labels. For the supervised learning problems, the evaluation is based on the accuracy of the predicted labels. 366 | 367 | ### Misc 368 | * `README.md`: This documentation. 369 | * `utils.py`: Common functions. 370 | * `convert_prediction_from_json_to_csv.py`: Converts the JSON output of batch prediction into a CSV file for submission. 371 | 372 | ## TODO for participants 373 | * Create your model in mnist_models.py. 374 | * Create a loss function in losses.py. 375 | * If necessary, add input preprocessing in readers.py. 376 | * Adjust the parameters in train.py, such as the batch size and learning rate, and modify the training procedure if it is necessary. 377 | * Train your model by calling train.py using your model name and loss function. 378 | * Call eval.py to examine the performance of trained model on validation data. 379 | * Call inference.py to obtain the predicted labels using your model, and submit these labels in [Kaggle](https://inclass.kaggle.com/c/mnist-tutorial-machine-learning-challenge) for evaluation. 380 | * In general, you are free to modify any file if it improves the performance of your model. 381 | 382 | ## Etc 383 | * [Tensorflow MNIST Tutorial](https://www.tensorflow.org/get_started/mnist/beginners) 384 | 385 | ## About This Project 386 | This project is meant help people quickly get started working with the 387 | [mnist](https://inclass.kaggle.com/c/mnist-tutorial-machine-learning-challenge) dataset. 388 | This is not an official Google product. 389 | -------------------------------------------------------------------------------- /mnist/README.md: -------------------------------------------------------------------------------- 1 | # MNIST Tensorflow Starter Code 2 | 3 | For the English version description, please follow the link: [Link](README.en.md) 4 | 5 | 본 저장소는 MNIST 문제를 해결하기 위한 트레이닝과 머신러닝 모델 평가를 위한 기본적인 코드를 담고 있습니다. 6 | 7 | 본 저장소에 있는 코드는 데이터를 읽고, TensorFlow 모델을 훈련하고, 모델의 성능을 평가하는 일련의 과정에 대한 예제 코드입니다. 8 | 여러분은 데이터를 이용해서 [model architectures](#overview-of-models)에 주어진 것과 같은 여러 모델들을 훈련할 수 있습니다. 9 | 또한, 본 코드는 주어진 모델 외에도 여러분의 커스텀 모델에도 쉽게 적용될 수 있도록 디자인 되어 있습니다. 10 | 11 | MNIST를 훈련시키고 평가하는 방법에는 두 가지가 있습니다: Google Cloud 위에서 작업하실 수도 있고, 여러분의 로컬 머신에서 작업하실 수도 있습니다. 12 | 본 README는 양쪽 방법 모두를 다루고자 합니다. 13 | 14 | ## Table of Contents 15 | * [Running on Google's Cloud Machine Learning Platform](#running-on-googles-cloud-machine-learning-platform) 16 | * [Requirements](#requirements) 17 | * [Testing Locally](#testing-locally) 18 | * [Training on the Cloud](#training-on-features) 19 | * [Evaluation and Inference](#evaluation-and-inference) 20 | * [Inference Using Batch Prediction](#inference-using-batch-prediction) 21 | * [Accessing Files on Google Cloud](#accessing-files-on-google-cloud) 22 | * [Using Larger Machine Types](#using-larger-machine-types) 23 | * [Running on Your Own Machine](#running-on-your-own-machine) 24 | * [Requirements](#requirements-1) 25 | * [Overview of Models](#overview-of-models) 26 | * [Overview of Files](#overview-of-files) 27 | * [Training](#training) 28 | * [Evaluation](#evaluation) 29 | * [Inference](#inference) 30 | * [Misc](#misc) 31 | * [TODO for paticiapnts](#todo-for-participants) 32 | * [Etc](#etc) 33 | * [About This Project](#about-this-project) 34 | 35 | ## Running on Google's Cloud Machine Learning Platform 36 | 37 | ### Requirements 38 | 39 | 먼저 여러분의 계정이 Google Cloud Platform을 사용할 수 있도록 세팅되어야 합니다. 40 | 본 [링크](https://cloud.google.com/ml/docs/how-tos/getting-set-up)를 따라서 계정 생성 및 설정을 진행해주세요. 41 | 42 | 또한, 다음 명령들을 실행하여 Python 2.7 이상, 그리고 TensorFlow 1.0.1 이상 버전이 설치되어 있는지 확인하시기 바랍니다: 43 | 44 | ```sh 45 | python --version 46 | python -c 'import tensorflow as tf; print(tf.__version__)' 47 | ``` 48 | 49 | ### Testing Locally 50 | 다음 gcloud 명령들은 모두 *저장소 루트* 경로에서 실행되어야 합니다. 51 | 현재 디렉토리의 위치는 명령어 'ls'를 통하여 확인하실 수 있습니다. 52 | 53 | 모델 개발 도중, 간단한 이슈들을 확인하기 위해 클라우드에 올릴 필요 없이 바로 로컬에서 테스트 해 보실 수 있습니다. 54 | `gcloud beta ml local` 와 같은 명령어를 통해 사용하시면 되며, 예제는 다음과 같습니다: 55 | 56 | ```sh 57 | gcloud ml-engine local train \ 58 | --package-path=mnist --module-name=mnist.train -- \ 59 | --train_data_pattern='gs://kmlc_test_train_bucket/mnist/train.tfrecords' \ 60 | --train_dir=/tmp/kmlc_mnist_train --model=LogisticModel --start_new_model 61 | ``` 62 | 63 | 훈련 데이터를 현 디렉토리에 받는 법은 다음과 같습니다. 64 | 65 | ```sh 66 | gsutil cp gs://kmlc_test_train_bucket/mnist/train.tfrecords . 67 | ``` 68 | 69 | 파일을 다운로드 받고 나면, `train_data_pattern` 인자를 통하여 job에 해당 파일을 입력으로 사용할 수 있습니다.("gs://..." 경로를 통하여 원격에 있는 파일이 아닌, 로컬 파일을 지정할 수 있습니다.) 70 | 71 | 모델이 로컬에서 동작하는 것을 확인하셨다면, 아래 쓰인 방법을 통하여 클라우드에서도 시도해 보실 차례입니다. 72 | 73 | ### Training on the Cloud 74 | 75 | 이번 항목에서는 훈련 데이터와 테스트 파일들에 접근하기 위해 Google Cloud를 사용할 것입니다. 과거에 해당 계정이서 Google Cloud를 사용하신 적이 없다면 $300 상당의 Credit이 주어지게 됩니다. 다음 스텝들을 따라 가시면 프로젝트 초기화, 데이터 다운로드 및 Google Cloud ML상에서의 훈련 / 테스팅 등을 경험하실 수 있습니다. 76 | 77 | #### Set up your Google Cloud project 78 | 79 | [Link](https://mlchallenge2017.com/references-kr.html)에 쓰인 프로젝트 생성 및 개발환경 설정을 따라주시기 바랍니다. 80 | 81 | #### Running training 82 | 83 | 다음 명령어는 해당 모델을 Google Cloud 위에서 훈련시킬 것입니다. 다음 명령어들은 Google Cloud Shell위에서 실행되어야 합니다. 84 | 85 | ```sh 86 | BUCKET_NAME=gs://${USER}_kmlc_mnist_train_bucket 87 | # (One Time) Create a storage bucket to store training logs and checkpoints. 88 | gsutil mb -l us-east1 $BUCKET_NAME 89 | # Submit the training job. 90 | JOB_NAME=kmlc_mnist_train_$(date +%Y%m%d_%H%M%S); gcloud --verbosity=debug ml-engine jobs \ 91 | submit training $JOB_NAME \ 92 | --package-path=mnist --module-name=mnist.train \ 93 | --staging-bucket=$BUCKET_NAME --region=us-east1 \ 94 | --config=mnist/cloudml-gpu.yaml \ 95 | -- --train_data_pattern='gs://kmlc_test_train_bucket/mnist/train.tfrecords' \ 96 | --model=LogisticModel \ 97 | --train_dir=$BUCKET_NAME/kmlc_mnist_train_logistic_model 98 | ``` 99 | 100 | 위 `gsutil` 명령어에서 'package-path' 플래그는 'train.py' 스크립트를 포함하고 있는 경로를 의미하며, 동시에 Cloud worker에 업로드 될 패키지를 의미하기도 합니다. 'module-name'은 실행되어야 할 파이선 스크립트를 지정하는 플래그입니다.(본 예제에서는 train module을 사용하고 있습니다.) 101 | 102 | Google Cloud에서 업로드 후 job들이 실행되기까지는 잠시 시간이 걸립니다. 103 | 실행이 되고 나면 다음과 같은 메세지들을 보실 수 있습니다: 104 | 105 | ``` 106 | training step 270| Hit@1: 0.68 PERR: 0.52 Loss: 638.453 107 | training step 271| Hit@1: 0.66 PERR: 0.49 Loss: 635.537 108 | training step 272| Hit@1: 0.70 PERR: 0.52 Loss: 637.564 109 | ``` 110 | 111 | 작업 도중 "ctrl-c" 를 눌러 콘솔에서 접속을 끊을 수 있습니다. 모델은 클라우드 상에서 독립적으로 계속 훈련이 진행되며, 해당 job에 대한 진행 상황을 확인하거나 멈추는 등의 작업은 [Google Cloud ML Jobs console](https://console.cloud.google.com/ml/jobs) 를 통해서 하실 수 있습니다. 112 | 113 | 여러 job들을 동시에 훈련시킬 수도 있으며, tensorboard를 통해서 모델들의 퍼포먼스를 시각화하여 보실 수 있습니다. 114 | 115 | ```sh 116 | tensorboard --logdir=$BUCKET_NAME --port=8080 117 | ``` 118 | 119 | Tensorboard가 실행되고 나면 다음 명령어를 통해서 Tensorboard를 보실 수 있습니다: [http://localhost:8080](http://localhost:8080) 120 | 만약 Google Cloud Shell에서 실행하셨다면 콘솔 위쪽에 있는 Web Preview 버튼을 누르고 "Preview on port 8080" 메뉴를 통해서 Tensorboard를 보실 수 있습니다. 121 | 122 | ### Evaluation and Inference 123 | 다음은 모델을 Validation dataset위에서 평가하는 방법입니다: 124 | 125 | ```sh 126 | JOB_TO_EVAL=kmlc_mnist_train_logistic_model 127 | JOB_NAME=kmlc_mnist_eval_$(date +%Y%m%d_%H%M%S); gcloud --verbosity=debug ml-engine jobs \ 128 | submit training $JOB_NAME \ 129 | --package-path=mnist --module-name=mnist.eval \ 130 | --staging-bucket=$BUCKET_NAME --region=us-east1 \ 131 | --config=mnist/cloudml-gpu.yaml \ 132 | -- --eval_data_pattern='gs://kmlc_test_train_bucket/mnist/validation.tfrecords' \ 133 | --model=LogisticModel \ 134 | --train_dir=$BUCKET_NAME/${JOB_TO_EVAL} --run_once=True 135 | ``` 136 | 137 | 다음은 모델을 Test set위에서 실행하는 방법입니다: 138 | 139 | ```sh 140 | JOB_TO_EVAL=kmlc_mnist_train_logistic_model 141 | JOB_NAME=kmlc_mnist_inference_$(date +%Y%m%d_%H%M%S); gcloud --verbosity=debug ml-engine jobs \ 142 | submit training $JOB_NAME \ 143 | --package-path=mnist --module-name=mnist.inference \ 144 | --staging-bucket=$BUCKET_NAME --region=us-east1 \ 145 | --config=mnist/cloudml-gpu.yaml \ 146 | -- --input_data_pattern='gs://kmlc_test_train_bucket/mnist/test.tfrecords' \ 147 | --train_dir=$BUCKET_NAME/${JOB_TO_EVAL} \ 148 | --output_file=$BUCKET_NAME/${JOB_TO_EVAL}/predictions.csv 149 | ``` 150 | 151 | 위 gcloud 명령어에서 'training'이라는 부분이 다소 착각의 여지가 있습니다. 이름과는 다르게, 'training'이란 명령어는 클라우드가 호스팅하는 Python/Tensorflow 서비스를 제공하는 일을 합니다. Cloud Platform의 관점에서 보면 training과 inference는 아무 차이가 없는 일이기 때문입니다. Cloud ML Platform은 Tensorflow를 통한 예측을 위한 특별한 기능들을 제공하기는 하나 본 문서에서는 다루지 않겠습니다. 152 | 153 | Job들이 시작되고 나면 Evaluation 코드에 대해서는 다음과 같은 메세지를 보시게 됩니다: 154 | 155 | ``` 156 | examples_processed: 1024 | global_step 447044 | Batch Hit@1: 0.782 | Batch PERR: 0.637 | Batch Loss: 7.821 | Examples_per_sec: 834.658 157 | ``` 158 | 159 | Inference 코드에 대해서는 다음과 같은 메세지를 보시게 됩니다: 160 | 161 | ``` 162 | num examples processed: 8192 elapsed seconds: 14.85 163 | ``` 164 | 165 | ### Inference Using Batch Prediction 166 | Inference를 좀 더 빠르게 진행하기 위하여 Cloud ML Batch Prediction을 사용하실 수 있습니다. 167 | 168 | 먼저, Training job이 모델을 내보내는 디렉토리를 찾아주세요: 169 | 170 | ``` 171 | gsutil list ${BUCKET_NAME}/kmlc_mnist_train_logistic_model/export 172 | ``` 173 | 174 | 다음과 비슷한 출력을 보시게 될 겁니다: 175 | 176 | ``` 177 | ${BUCKET_NAME}/kmlc_mnist_train_logistic_model/export/ 178 | ${BUCKET_NAME}/kmlc_mnist_train_logistic_model/export/step_1/ 179 | ${BUCKET_NAME}/kmlc_mnist_train_logistic_model/export/step_1001/ 180 | ${BUCKET_NAME}/kmlc_mnist_train_logistic_model/export/step_2001/ 181 | ${BUCKET_NAME}/kmlc_mnist_train_logistic_model/export/step_3001/ 182 | ``` 183 | 184 | 가장 최근에 저장된 모델을 선택하세요. 우리는 3001번째 step에 저장된 모델을 가지고 진행하도록 하겠습니다: 185 | 186 | ``` 187 | EXPORTED_MODEL_DIR=${BUCKET_NAME}/kmlc_mnist_train_logistic_model/export/step_3001/ 188 | ``` 189 | 190 | 다음 명령어를 사용하여 Batch Prediction을 시작하세요: 191 | 192 | ``` 193 | JOB_NAME=kmlc_mnist_batch_predict_$(date +%Y%m%d_%H%M%S); \ 194 | gcloud ml-engine jobs submit prediction ${JOB_NAME} --verbosity=debug \ 195 | --model-dir=${EXPORTED_MODEL_DIR} --data-format=TF_RECORD \ 196 | --input-paths='gs://kmlc_test_train_bucket/mnist/test.tfrecords' \ 197 | --output-path=${BUCKET_NAME}/batch_predict/${JOB_NAME}.csv --region=us-east1 \ 198 | --runtime-version=1.2 --max-worker-count=10 199 | ``` 200 | 201 | 이후 진척 상황은 [Google Cloud ML Jobs console](https://console.cloud.google.com/ml/jobs)에서 확인하실 수 있습니다. Job을 좀 더 빠르게 진행하기를 원하시면 'max-worker-count'를 증가시키시면 됩니다. 202 | 203 | Batch Prediction Job이 끝나면 다음 명령어를 통해서 결과물을 CSV 형식으로 저장해주세요: 204 | 205 | ``` 206 | # Copy the output of the batch prediction job to a local directory 207 | mkdir -p /tmp/batch_predict/${JOB_NAME} 208 | gsutil -m cp -r ${BUCKET_NAME}/batch_predict/${JOB_NAME}.csv /tmp/batch_predict/${JOB_NAME}.csv 209 | ``` 210 | 211 | /tmp/batch_predict/${JOB_NAME}.csv 를 [Kaggle](https://inclass.kaggle.com/c/mnist-tutorial-machine-learning-challenge)에 제출해주세요. 212 | 213 | 214 | ### Accessing Files on Google Cloud 215 | 216 | [Google Cloud storage browser](https://console.cloud.google.com/storage/browser)을 통해서 이전에 생성한 storage bucket을 직접 조회할 수 있습니다. 해당 버킷에 저장된 Trained model, CSV 파일 등을 직접 조회할 수 있습니다. 217 | 218 | 다른 방법으로는 'gsutil' 명령어를 통해서 파일을 직접 다운로드 받을 수도 있습니다. 219 | 예를 들어 방금 생성한 Prediction CSV를 로컬 머신으로 다운로드 받고자 한다면 다음 명령어를 실행하시면 됩니다: 220 | 221 | ``` 222 | gsutil cp $BUCKET_NAME/${JOB_TO_EVAL}/predictions.csv . 223 | ``` 224 | 225 | ### Using Larger Machine Types 226 | 몇몇 복잡한 모델들은 단일 GPU로 훈련시 모델 수렴까지 일주일 이상이 걸리기도 합니다. 이러한 모델들은 좀 더 많은 GPU가 있는 강력한 머신으로 좀 더 빠르게 훈련이 가능합니다. 4개의 GPU가 있는 머신을 사용하하기 위해서는, '--config' 플래그를 'mnist/cloudml-4gpu.yaml'로 변경해주세요. 이 플래그는 소모하는 Cloud credit 역시 4배로 늘어난다는 점 주의하시기 바랍니다. 227 | 228 | ## Running on Your Own Machine 229 | 230 | ### Requirements 231 | 232 | 시작하기 전 Tensorflow가 준비되어 있어야 합니다. 만약 아직 설치하지 않으셨다면 [tensorflow.org](https://www.tensorflow.org/install/)에 쓰인 설명을 따라주세요. 233 | 234 | 다음 명령어를 통해 Python 2.7 이상, 그리고 Tensorflow 1.0.1 이상이 설치되어 있는지 확인하시기 바랍니다: 235 | 236 | ```sh 237 | python --version 238 | python -c 'import tensorflow as tf; print(tf.__version__)' 239 | ``` 240 | 241 | Downloading files 242 | ``` sh 243 | gsutil cp gs://kmlc_test_train_bucket/mnist/train* . 244 | gsutil cp gs://kmlc_test_train_bucket/mnist/test* . 245 | gsutil cp gs://kmlc_test_train_bucket/mnist/validation* . 246 | ``` 247 | 248 | Training 249 | ```sh 250 | python train.py --train_data_pattern='/path/to/training/files/*' --train_dir=/tmp/mnist_train --model=LogisticModel --start_new_model 251 | ``` 252 | 253 | Validation 254 | ```sh 255 | python eval.py --eval_data_pattern='/path/to/validation/files' --train_dir=/tmp/mnist_train --model=LogisticModel --run_once=True 256 | ``` 257 | 258 | Generating submission 259 | ```sh 260 | python inference.py --output_file=/path/to/predictions.csv --input_data_pattern='/path/to/test/files/*' --train_dir=/tmp/mnist_train 261 | ``` 262 | 263 | ## Overview of Models 264 | 265 | 예제 코드는 Logistic model의 구현체를 담고 있습니다: 266 | 267 | * `[LogisticModel](https://ko.wikipedia.org/wiki/%EB%A1%9C%EC%A7%80%EC%8A%A4%ED%8B%B1_%ED%9A%8C%EA%B7%80)`: 독립 변수의 선형 결합을 이용하여 사건의 발생 가능성을 예측하는데 사용되는 통계 기법입니다. 268 | 269 | ## Overview of Files 270 | 271 | ### Training 272 | * `train.py`: 훈련을 위한 전달인자와 과정을 정의합니다. 변경 가능한 전달인자로는 훈련 데이터셋의 위치, 훈련에 사용될 모델, 배치 사이즈, 로스 함수, 학습 레이트 등이 있습니다. 모델에 따라 get_input_data_tensors()를 변경하여 데이터가 셔플되는 과정을 수정하실 수도 있습니다. 273 | * `losses.py`: 로스 함수를 정의합니다. losses.py에 정의된 어떤 로스 함수도 train.py에서 사용하실 수 있습니다. 274 | * `models.py`: 모델을 정의하기 위한 Base class를 포함하고 있습니다. 275 | * `mnist_models.py`: 인풋에서 관찰해야 하는 특성들을 입력으로 받을 수 있는 모델에 대한 정의가 있습니다. 여러분은 여러분만의 모델 또한 이 곳에 정의하셔야 합니다. train.py를 호출할 떄 인자로 --model=YourModelName 을 전달하여 여러분의 모델을 부를 수 있습니다. 276 | * `export_model.py`: 배치 예측 작업에서 쓰일 모델을 추출하기 위한 클래스를 제공합니다. 277 | * `readers.py`: 데이터에 대한 정의와 인풋 데이터가 어떤 식으로 준비되어야 하는지를 표기하고 있습니다. prepare_serialized_examples()를 변경함으로써 인풋 파일을 프리프로세스 할 수도 있습니다. 예를 들면 데이터를 리사이즈하거나 랜덤 노이즈를 섞을 수 있습니다. 278 | 279 | ### Evaluation 280 | * `eval.py`: 모델을 평가하기 위한 기본적인 스크립트입니다. 모델이 훈련되고 나면 --train_dir=/path/to/model 과 --model=YourModelName 인자와 함께 eval.py를 호출하여 train_dir 경로 안에 있는 모델을 불러올 수 있습니다. 보통은 이 파일을 변경할 필요는 없습니다. 281 | * `eval_util.py`: 모든 종류의 평가 메트릭들을 계산하는 클래스를 제공합니다. 282 | * `average_precision_calculator.py`: Average precision을 계산하는 함수들을 제공합니다. 283 | * `mean_average_precision_calculator.py`: Mean average precision을 계산하는 함수들을 제공합니다. 284 | 285 | ### Inference 286 | * `inference.py`: inference.py를 호출하여 테스트 데이터에 대한 예측 레이블들을 생성하세요. Supervised learning 문제에서 채점은 Predicted label의 정확도에 기반하여 진행됩니다. 287 | 288 | ### Misc 289 | * `README.md`: 이 문서입니다. 290 | * `utils.py`: 일반적인 유틸리티 함수입니다. 291 | * `convert_prediction_from_json_to_csv.py`: 배치 prediction의 JSON 출력 파일을 CSV로 변경합니다. 292 | 293 | ## TODO for participants 294 | * mnist_models.py 안에 여러분만의 모델을 생성하세요. 295 | * losses.py 안에 로스 함수를 생성하세요. 296 | * 필요하다면, readers.py 안에 입력 프리프로세싱을 추가해주세요 297 | * train.py 파일 안에서 batch size나 learning rate와 같은 parameter들을 조절하고, 필요하다면 training procedure도 수정해주세요. 298 | * 여러분의 모델 이름과 로스 함수를 이용해서 train.py를 호출하여 모델을 훈련하세요. 299 | * eval.py를 호출하여 검증 데이터 위에서 훈련된 모델의 성능을 검증하세요. 300 | * inference.py를 통하여 훈련된 모델을 테스트 데이터에 적용하고, 결과 예측 레이블 정보를 채점을 위해 [Kaggle](https://inclass.kaggle.com/c/mnist-tutorial-machine-learning-challenge)에 제출하세요. 301 | * 본 시나리오는 MLC에 참가하기 위한 가장 쉬운 시나리오 중 하나이며, 여러분만의 솔루션과 더 나은 성능을 위해서라면 어떤 파일을 수정하셔도 무방합니다. 302 | 303 | ## Etc 304 | * [Tensorflow MNIST Tutorial](https://www.tensorflow.org/get_started/mnist/beginners) 305 | 306 | ## About This Project 307 | This project is meant help people quickly get started working with the 308 | [mnist](https://inclass.kaggle.com/c/mnist-tutorial-machine-learning-challenge) dataset. 309 | This is not an official Google product. 310 | -------------------------------------------------------------------------------- /mnist/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright 2016 Google Inc. All Rights Reserved. 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # http://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS-IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | -------------------------------------------------------------------------------- /mnist/average_precision_calculator.py: -------------------------------------------------------------------------------- 1 | # Copyright 2016 Google Inc. All Rights Reserved. 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # http://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS-IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | 15 | """Calculate or keep track of the interpolated average precision. 16 | 17 | It provides an interface for calculating interpolated average precision for an 18 | entire list or the top-n ranked items. For the definition of the 19 | (non-)interpolated average precision: 20 | http://trec.nist.gov/pubs/trec15/appendices/CE.MEASURES06.pdf 21 | 22 | Example usages: 23 | 1) Use it as a static function call to directly calculate average precision for 24 | a short ranked list in the memory. 25 | 26 | ``` 27 | import random 28 | 29 | p = np.array([random.random() for _ in xrange(10)]) 30 | a = np.array([random.choice([0, 1]) for _ in xrange(10)]) 31 | 32 | ap = average_precision_calculator.AveragePrecisionCalculator.ap(p, a) 33 | ``` 34 | 35 | 2) Use it as an object for long ranked list that cannot be stored in memory or 36 | the case where partial predictions can be observed at a time (Tensorflow 37 | predictions). In this case, we first call the function accumulate many times 38 | to process parts of the ranked list. After processing all the parts, we call 39 | peek_interpolated_ap_at_n. 40 | ``` 41 | p1 = np.array([random.random() for _ in xrange(5)]) 42 | a1 = np.array([random.choice([0, 1]) for _ in xrange(5)]) 43 | p2 = np.array([random.random() for _ in xrange(5)]) 44 | a2 = np.array([random.choice([0, 1]) for _ in xrange(5)]) 45 | 46 | # interpolated average precision at 10 using 1000 break points 47 | calculator = average_precision_calculator.AveragePrecisionCalculator(10) 48 | calculator.accumulate(p1, a1) 49 | calculator.accumulate(p2, a2) 50 | ap3 = calculator.peek_ap_at_n() 51 | ``` 52 | """ 53 | 54 | import heapq 55 | import random 56 | import numbers 57 | 58 | import numpy 59 | 60 | 61 | class AveragePrecisionCalculator(object): 62 | """Calculate the average precision and average precision at n.""" 63 | 64 | def __init__(self, top_n=None): 65 | """Construct an AveragePrecisionCalculator to calculate average precision. 66 | 67 | This class is used to calculate the average precision for a single label. 68 | 69 | Args: 70 | top_n: A positive Integer specifying the average precision at n, or 71 | None to use all provided data points. 72 | 73 | Raises: 74 | ValueError: An error occurred when the top_n is not a positive integer. 75 | """ 76 | if not ((isinstance(top_n, int) and top_n >= 0) or top_n is None): 77 | raise ValueError("top_n must be a positive integer or None.") 78 | 79 | self._top_n = top_n # average precision at n 80 | self._total_positives = 0 # total number of positives have seen 81 | self._heap = [] # max heap of (prediction, actual) 82 | 83 | @property 84 | def heap_size(self): 85 | """Gets the heap size maintained in the class.""" 86 | return len(self._heap) 87 | 88 | @property 89 | def num_accumulated_positives(self): 90 | """Gets the number of positive samples that have been accumulated.""" 91 | return self._total_positives 92 | 93 | def accumulate(self, predictions, actuals, num_positives=None): 94 | """Accumulate the predictions and their ground truth labels. 95 | 96 | After the function call, we may call peek_ap_at_n to actually calculate 97 | the average precision. 98 | Note predictions and actuals must have the same shape. 99 | 100 | Args: 101 | predictions: a list storing the prediction scores. 102 | actuals: a list storing the ground truth labels. Any value 103 | larger than 0 will be treated as positives, otherwise as negatives. 104 | num_positives = If the 'predictions' and 'actuals' inputs aren't complete, 105 | then it's possible some true positives were missed in them. In that case, 106 | you can provide 'num_positives' in order to accurately track recall. 107 | 108 | Raises: 109 | ValueError: An error occurred when the format of the input is not the 110 | numpy 1-D array or the shape of predictions and actuals does not match. 111 | """ 112 | if len(predictions) != len(actuals): 113 | raise ValueError("the shape of predictions and actuals does not match.") 114 | 115 | if not num_positives is None: 116 | if not isinstance(num_positives, numbers.Number) or num_positives < 0: 117 | raise ValueError("'num_positives' was provided but it wan't a nonzero number.") 118 | 119 | if not num_positives is None: 120 | self._total_positives += num_positives 121 | else: 122 | self._total_positives += numpy.size(numpy.where(actuals > 0)) 123 | topk = self._top_n 124 | heap = self._heap 125 | 126 | for i in range(numpy.size(predictions)): 127 | if topk is None or len(heap) < topk: 128 | heapq.heappush(heap, (predictions[i], actuals[i])) 129 | else: 130 | if predictions[i] > heap[0][0]: # heap[0] is the smallest 131 | heapq.heappop(heap) 132 | heapq.heappush(heap, (predictions[i], actuals[i])) 133 | 134 | def clear(self): 135 | """Clear the accumulated predictions.""" 136 | self._heap = [] 137 | self._total_positives = 0 138 | 139 | def peek_ap_at_n(self): 140 | """Peek the non-interpolated average precision at n. 141 | 142 | Returns: 143 | The non-interpolated average precision at n (default 0). 144 | If n is larger than the length of the ranked list, 145 | the average precision will be returned. 146 | """ 147 | if self.heap_size <= 0: 148 | return 0 149 | predlists = numpy.array(list(zip(*self._heap))) 150 | 151 | ap = self.ap_at_n(predlists[0], 152 | predlists[1], 153 | n=self._top_n, 154 | total_num_positives=self._total_positives) 155 | return ap 156 | 157 | @staticmethod 158 | def ap(predictions, actuals): 159 | """Calculate the non-interpolated average precision. 160 | 161 | Args: 162 | predictions: a numpy 1-D array storing the sparse prediction scores. 163 | actuals: a numpy 1-D array storing the ground truth labels. Any value 164 | larger than 0 will be treated as positives, otherwise as negatives. 165 | 166 | Returns: 167 | The non-interpolated average precision at n. 168 | If n is larger than the length of the ranked list, 169 | the average precision will be returned. 170 | 171 | Raises: 172 | ValueError: An error occurred when the format of the input is not the 173 | numpy 1-D array or the shape of predictions and actuals does not match. 174 | """ 175 | return AveragePrecisionCalculator.ap_at_n(predictions, 176 | actuals, 177 | n=None) 178 | 179 | @staticmethod 180 | def ap_at_n(predictions, actuals, n=20, total_num_positives=None): 181 | """Calculate the non-interpolated average precision. 182 | 183 | Args: 184 | predictions: a numpy 1-D array storing the sparse prediction scores. 185 | actuals: a numpy 1-D array storing the ground truth labels. Any value 186 | larger than 0 will be treated as positives, otherwise as negatives. 187 | n: the top n items to be considered in ap@n. 188 | total_num_positives : (optionally) you can specify the number of total 189 | positive 190 | in the list. If specified, it will be used in calculation. 191 | 192 | Returns: 193 | The non-interpolated average precision at n. 194 | If n is larger than the length of the ranked list, 195 | the average precision will be returned. 196 | 197 | Raises: 198 | ValueError: An error occurred when 199 | 1) the format of the input is not the numpy 1-D array; 200 | 2) the shape of predictions and actuals does not match; 201 | 3) the input n is not a positive integer. 202 | """ 203 | if len(predictions) != len(actuals): 204 | raise ValueError("the shape of predictions and actuals does not match.") 205 | 206 | if n is not None: 207 | if not isinstance(n, int) or n <= 0: 208 | raise ValueError("n must be 'None' or a positive integer." 209 | " It was '%s'." % n) 210 | 211 | ap = 0.0 212 | 213 | predictions = numpy.array(predictions) 214 | actuals = numpy.array(actuals) 215 | 216 | # add a shuffler to avoid overestimating the ap 217 | predictions, actuals = AveragePrecisionCalculator._shuffle(predictions, 218 | actuals) 219 | sortidx = sorted( 220 | range(len(predictions)), 221 | key=lambda k: predictions[k], 222 | reverse=True) 223 | 224 | if total_num_positives is None: 225 | numpos = numpy.size(numpy.where(actuals > 0)) 226 | else: 227 | numpos = total_num_positives 228 | 229 | if numpos == 0: 230 | return 0 231 | 232 | if n is not None: 233 | numpos = min(numpos, n) 234 | delta_recall = 1.0 / numpos 235 | poscount = 0.0 236 | 237 | # calculate the ap 238 | r = len(sortidx) 239 | if n is not None: 240 | r = min(r, n) 241 | for i in range(r): 242 | if actuals[sortidx[i]] > 0: 243 | poscount += 1 244 | ap += poscount / (i + 1) * delta_recall 245 | return ap 246 | 247 | @staticmethod 248 | def _shuffle(predictions, actuals): 249 | random.seed(0) 250 | suffidx = random.sample(range(len(predictions)), len(predictions)) 251 | predictions = predictions[suffidx] 252 | actuals = actuals[suffidx] 253 | return predictions, actuals 254 | 255 | @staticmethod 256 | def _zero_one_normalize(predictions, epsilon=1e-7): 257 | """Normalize the predictions to the range between 0.0 and 1.0. 258 | 259 | For some predictions like SVM predictions, we need to normalize them before 260 | calculate the interpolated average precision. The normalization will not 261 | change the rank in the original list and thus won't change the average 262 | precision. 263 | 264 | Args: 265 | predictions: a numpy 1-D array storing the sparse prediction scores. 266 | epsilon: a small constant to avoid denominator being zero. 267 | 268 | Returns: 269 | The normalized prediction. 270 | """ 271 | denominator = numpy.max(predictions) - numpy.min(predictions) 272 | ret = (predictions - numpy.min(predictions)) / numpy.max(denominator, 273 | epsilon) 274 | return ret 275 | -------------------------------------------------------------------------------- /mnist/cloudml-4gpu.yaml: -------------------------------------------------------------------------------- 1 | trainingInput: 2 | scaleTier: CUSTOM 3 | masterType: complex_model_m_gpu 4 | runtimeVersion: "1.2" 5 | -------------------------------------------------------------------------------- /mnist/cloudml-gpu-distributed.yaml: -------------------------------------------------------------------------------- 1 | trainingInput: 2 | runtimeVersion: "1.2" 3 | scaleTier: CUSTOM 4 | masterType: standard_gpu 5 | workerCount: 2 6 | workerType: standard_gpu 7 | parameterServerCount: 2 8 | parameterServerType: standard 9 | -------------------------------------------------------------------------------- /mnist/cloudml-gpu.yaml: -------------------------------------------------------------------------------- 1 | trainingInput: 2 | scaleTier: CUSTOM 3 | masterType: standard_gpu 4 | runtimeVersion: "1.2" 5 | -------------------------------------------------------------------------------- /mnist/convert_prediction_from_json_to_csv.py: -------------------------------------------------------------------------------- 1 | # Copyright 2016 Google Inc. All Rights Reserved. 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # http://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS-IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | 15 | """Utility to convert the output of batch prediction into a CSV submission. 16 | 17 | It converts the JSON files created by the command 18 | 'gcloud beta ml jobs submit prediction' into a CSV file ready for submission. 19 | """ 20 | 21 | import json 22 | import tensorflow as tf 23 | 24 | from builtins import range 25 | from tensorflow import app 26 | from tensorflow import flags 27 | from tensorflow import gfile 28 | from tensorflow import logging 29 | 30 | 31 | FLAGS = flags.FLAGS 32 | 33 | if __name__ == '__main__': 34 | 35 | flags.DEFINE_string( 36 | "json_prediction_files_pattern", None, 37 | "Pattern specifying the list of JSON files that the command " 38 | "'gcloud beta ml jobs submit prediction' outputs. These files are " 39 | "located in the output path of the prediction command and are prefixed " 40 | "with 'prediction.results'.") 41 | flags.DEFINE_string( 42 | "csv_output_file", None, 43 | "The file to save the predictions converted to the CSV format.") 44 | 45 | 46 | def get_csv_header(): 47 | return "Id,Category\n" 48 | 49 | def to_csv_row(json_data): 50 | 51 | video_id = json_data["video_id"] 52 | 53 | class_indexes = json_data["class_indexes"] 54 | predictions = json_data["predictions"] 55 | 56 | if isinstance(video_id, list): 57 | video_id = video_id[0] 58 | class_indexes = class_indexes[0] 59 | predictions = predictions[0] 60 | 61 | if len(class_indexes) != len(predictions): 62 | raise ValueError( 63 | "The number of indexes (%s) and predictions (%s) must be equal." 64 | % (len(class_indexes), len(predictions))) 65 | 66 | return (video_id.decode('utf-8') + "," + " ".join("%i %f" % 67 | (class_indexes[i], predictions[i]) 68 | for i in range(len(class_indexes))) + "\n") 69 | 70 | def main(unused_argv): 71 | logging.set_verbosity(tf.logging.INFO) 72 | 73 | if not FLAGS.json_prediction_files_pattern: 74 | raise ValueError( 75 | "The flag --json_prediction_files_pattern must be specified.") 76 | 77 | if not FLAGS.csv_output_file: 78 | raise ValueError("The flag --csv_output_file must be specified.") 79 | 80 | logging.info("Looking for prediction files with pattern: %s", 81 | FLAGS.json_prediction_files_pattern) 82 | 83 | file_paths = gfile.Glob(FLAGS.json_prediction_files_pattern) 84 | logging.info("Found files: %s", file_paths) 85 | 86 | logging.info("Writing submission file to: %s", FLAGS.csv_output_file) 87 | with gfile.Open(FLAGS.csv_output_file, "w+") as output_file: 88 | output_file.write(get_csv_header()) 89 | 90 | for file_path in file_paths: 91 | logging.info("processing file: %s", file_path) 92 | 93 | with gfile.Open(file_path) as input_file: 94 | 95 | for line in input_file: 96 | json_data = json.loads(line) 97 | output_file.write(to_csv_row(json_data)) 98 | 99 | output_file.flush() 100 | logging.info("done") 101 | 102 | if __name__ == "__main__": 103 | app.run() 104 | -------------------------------------------------------------------------------- /mnist/eval.py: -------------------------------------------------------------------------------- 1 | # Copyright 2016 Google Inc. All Rights Reserved. 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # http://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS-IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | """Binary for evaluating Tensorflow models on the YouTube-8M dataset.""" 15 | 16 | import time 17 | 18 | import eval_util 19 | import mnist_models 20 | import losses 21 | import readers 22 | import tensorflow as tf 23 | from tensorflow import app 24 | from tensorflow import flags 25 | from tensorflow import gfile 26 | from tensorflow import logging 27 | import utils 28 | 29 | FLAGS = flags.FLAGS 30 | 31 | if __name__ == "__main__": 32 | # Dataset flags. 33 | flags.DEFINE_string("train_dir", "/tmp/mnist_model/", 34 | "The directory to load the model files from. " 35 | "The tensorboard metrics files are also saved to this " 36 | "directory.") 37 | flags.DEFINE_string( 38 | "eval_data_pattern", "", 39 | "File glob defining the evaluation dataset in tensorflow.SequenceExample " 40 | "format. The SequenceExamples are expected to have an 'rgb' byte array " 41 | "sequence feature as well as a 'labels' int64 context feature.") 42 | 43 | flags.DEFINE_string( 44 | "model", "LogisticModel", "Which Model to use: see mnist_models.py") 45 | 46 | flags.DEFINE_integer("batch_size", 512, 47 | "How many examples to process per batch.") 48 | 49 | flags.DEFINE_string("label_loss", "CrossEntropyLoss", 50 | "Loss computed on validation data") 51 | 52 | # Other flags. 53 | flags.DEFINE_integer("num_readers", 8, 54 | "How many threads to use for reading input files.") 55 | 56 | flags.DEFINE_boolean("run_once", False, "Whether to run eval only once.") 57 | 58 | 59 | def find_class_by_name(name, modules): 60 | """Searches the provided modules for the named class and returns it.""" 61 | modules = [getattr(module, name, None) for module in modules] 62 | return next(a for a in modules if a) 63 | 64 | 65 | def get_input_evaluation_tensors(reader, 66 | data_pattern, 67 | batch_size=1024, 68 | num_readers=1): 69 | """Creates the section of the graph which reads the evaluation data. 70 | 71 | Args: 72 | reader: A class which parses the training data. 73 | data_pattern: A 'glob' style path to the data files. 74 | batch_size: How many examples to process at a time. 75 | num_readers: How many I/O threads to use. 76 | 77 | Returns: 78 | A tuple containing the ids, images, and labels 79 | 80 | Raises: 81 | IOError: If no files matching the given pattern were found. 82 | """ 83 | logging.info("Using batch size of " + str(batch_size) + " for evaluation.") 84 | with tf.name_scope("eval_input"): 85 | files = gfile.Glob(data_pattern) 86 | if not files: 87 | raise IOError("Unable to find the evaluation files.") 88 | logging.info("number of evaluation files: " + str(len(files))) 89 | filename_queue = tf.train.string_input_producer( 90 | files, shuffle=False, num_epochs=1) 91 | eval_data = [ 92 | reader.prepare_reader(filename_queue) for _ in range(num_readers) 93 | ] 94 | return tf.train.batch_join( 95 | eval_data, 96 | batch_size=batch_size, 97 | capacity=3 * batch_size, 98 | allow_smaller_final_batch=True, 99 | enqueue_many=True) 100 | 101 | 102 | def build_graph(reader, 103 | model, 104 | eval_data_pattern, 105 | label_loss_fn, 106 | batch_size=1024, 107 | num_readers=1): 108 | """Creates the Tensorflow graph for evaluation. 109 | 110 | Args: 111 | reader: The data file reader. It should inherit from BaseReader. 112 | model: The core model (e.g. logistic or neural net). It should inherit 113 | from BaseModel. 114 | eval_data_pattern: glob path to the evaluation data files. 115 | label_loss_fn: What kind of loss to apply to the model. It should inherit 116 | from BaseLoss. 117 | batch_size: How many examples to process at a time. 118 | num_readers: How many threads to use for I/O operations. 119 | """ 120 | 121 | global_step = tf.Variable(0, trainable=False, name="global_step") 122 | model_input_raw, labels_batch = get_input_evaluation_tensors( # pylint: disable=g-line-too-long 123 | reader, 124 | eval_data_pattern, 125 | batch_size=batch_size) 126 | tf.summary.histogram("model_input_raw", model_input_raw) 127 | 128 | model_input = model_input_raw 129 | 130 | with tf.variable_scope("tower"): 131 | result = model.create_model(model_input, 132 | vocab_size=reader.num_classes, 133 | labels=labels_batch, 134 | is_training=False) 135 | predictions = result["predictions"] 136 | tf.summary.histogram("model_activations", predictions) 137 | if "loss" in result.keys(): 138 | label_loss = result["loss"] 139 | else: 140 | label_loss = label_loss_fn.calculate_loss(predictions, labels_batch) 141 | 142 | tf.add_to_collection("global_step", global_step) 143 | tf.add_to_collection("loss", label_loss) 144 | tf.add_to_collection("predictions", predictions) 145 | tf.add_to_collection("input_batch", model_input) 146 | tf.add_to_collection("labels", tf.cast(labels_batch, tf.float32)) 147 | tf.add_to_collection("summary_op", tf.summary.merge_all()) 148 | 149 | 150 | def evaluation_loop(prediction_batch, label_batch, loss, 151 | summary_op, saver, summary_writer, evl_metrics, 152 | last_global_step_val): 153 | """Run the evaluation loop once. 154 | 155 | Args: 156 | prediction_batch: a tensor of predictions mini-batch. 157 | label_batch: a tensor of label_batch mini-batch. 158 | loss: a tensor of loss for the examples in the mini-batch. 159 | summary_op: a tensor which runs the tensorboard summary operations. 160 | saver: a tensorflow saver to restore the model. 161 | summary_writer: a tensorflow summary_writer 162 | evl_metrics: an EvaluationMetrics object. 163 | last_global_step_val: the global step used in the previous evaluation. 164 | 165 | Returns: 166 | The global_step used in the latest model. 167 | """ 168 | 169 | global_step_val = -1 170 | with tf.Session() as sess: 171 | latest_checkpoint = tf.train.latest_checkpoint(FLAGS.train_dir) 172 | if latest_checkpoint: 173 | logging.info("Loading checkpoint for eval: " + latest_checkpoint) 174 | # Restores from checkpoint 175 | saver.restore(sess, latest_checkpoint) 176 | # Assuming model_checkpoint_path looks something like: 177 | # /my-favorite-path/yt8m_train/model.ckpt-0, extract global_step from it. 178 | global_step_val = latest_checkpoint.split("/")[-1].split("-")[-1] 179 | else: 180 | logging.info("No checkpoint file found.") 181 | return global_step_val 182 | 183 | if global_step_val == last_global_step_val: 184 | logging.info("skip this checkpoint global_step_val=%s " 185 | "(same as the previous one).", global_step_val) 186 | return global_step_val 187 | 188 | sess.run([tf.local_variables_initializer()]) 189 | 190 | # Start the queue runners. 191 | fetches = [prediction_batch, label_batch, loss, summary_op] 192 | coord = tf.train.Coordinator() 193 | try: 194 | threads = [] 195 | for qr in tf.get_collection(tf.GraphKeys.QUEUE_RUNNERS): 196 | threads.extend(qr.create_threads( 197 | sess, coord=coord, daemon=True, 198 | start=True)) 199 | logging.info("enter eval_once loop global_step_val = %s. ", 200 | global_step_val) 201 | 202 | evl_metrics.clear() 203 | 204 | examples_processed = 0 205 | while not coord.should_stop(): 206 | batch_start_time = time.time() 207 | predictions_val, labels_val, loss_val, summary_val = sess.run( 208 | fetches) 209 | seconds_per_batch = time.time() - batch_start_time 210 | example_per_second = labels_val.shape[0] / seconds_per_batch 211 | examples_processed += labels_val.shape[0] 212 | 213 | iteration_info_dict = evl_metrics.accumulate(predictions_val, 214 | labels_val, loss_val) 215 | iteration_info_dict["examples_per_second"] = example_per_second 216 | 217 | iterinfo = utils.AddGlobalStepSummary( 218 | summary_writer, 219 | global_step_val, 220 | iteration_info_dict, 221 | summary_scope="Eval") 222 | logging.info("examples_processed: %d | %s", examples_processed, 223 | iterinfo) 224 | 225 | except tf.errors.OutOfRangeError as e: 226 | logging.info( 227 | "Done with batched inference. Now calculating global performance " 228 | "metrics.") 229 | # calculate the metrics for the entire epoch 230 | epoch_info_dict = evl_metrics.get() 231 | epoch_info_dict["epoch_id"] = global_step_val 232 | 233 | summary_writer.add_summary(summary_val, global_step_val) 234 | epochinfo = utils.AddEpochSummary( 235 | summary_writer, 236 | global_step_val, 237 | epoch_info_dict, 238 | summary_scope="Eval") 239 | logging.info(epochinfo) 240 | evl_metrics.clear() 241 | except Exception as e: # pylint: disable=broad-except 242 | logging.info("Unexpected exception: " + str(e)) 243 | coord.request_stop(e) 244 | 245 | coord.request_stop() 246 | coord.join(threads, stop_grace_period_secs=10) 247 | 248 | return global_step_val 249 | 250 | 251 | def evaluate(): 252 | tf.set_random_seed(0) # for reproducibility 253 | with tf.Graph().as_default(): 254 | reader = readers.MnistReader() 255 | 256 | model = find_class_by_name(FLAGS.model, 257 | [mnist_models])() 258 | label_loss_fn = find_class_by_name(FLAGS.label_loss, [losses])() 259 | 260 | if FLAGS.eval_data_pattern is "": 261 | raise IOError("'eval_data_pattern' was not specified. " + 262 | "Nothing to evaluate.") 263 | 264 | build_graph( 265 | reader=reader, 266 | model=model, 267 | eval_data_pattern=FLAGS.eval_data_pattern, 268 | label_loss_fn=label_loss_fn, 269 | num_readers=FLAGS.num_readers, 270 | batch_size=FLAGS.batch_size) 271 | logging.info("built evaluation graph") 272 | prediction_batch = tf.get_collection("predictions")[0] 273 | label_batch = tf.get_collection("labels")[0] 274 | loss = tf.get_collection("loss")[0] 275 | summary_op = tf.get_collection("summary_op")[0] 276 | 277 | saver = tf.train.Saver(tf.global_variables()) 278 | summary_writer = tf.summary.FileWriter( 279 | FLAGS.train_dir, graph=tf.get_default_graph()) 280 | 281 | evl_metrics = eval_util.EvaluationMetrics(reader.num_classes, 2) 282 | 283 | last_global_step_val = -1 284 | while True: 285 | last_global_step_val = evaluation_loop(prediction_batch, 286 | label_batch, loss, summary_op, 287 | saver, summary_writer, evl_metrics, 288 | last_global_step_val) 289 | if FLAGS.run_once: 290 | break 291 | 292 | 293 | def main(unused_argv): 294 | logging.set_verbosity(tf.logging.INFO) 295 | print("tensorflow version: %s" % tf.__version__) 296 | evaluate() 297 | 298 | 299 | if __name__ == "__main__": 300 | app.run() 301 | -------------------------------------------------------------------------------- /mnist/eval_util.py: -------------------------------------------------------------------------------- 1 | # Copyright 2016 Google Inc. All Rights Reserved. 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # http://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS-IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | 15 | """Provides functions to help with evaluating models.""" 16 | import datetime 17 | import numpy 18 | 19 | from tensorflow.python.platform import gfile 20 | 21 | import mean_average_precision_calculator as map_calculator 22 | import average_precision_calculator as ap_calculator 23 | 24 | def flatten(l): 25 | """ Merges a list of lists into a single list. """ 26 | return [item for sublist in l for item in sublist] 27 | 28 | def calculate_hit_at_one(predictions, actuals): 29 | """Performs a local (numpy) calculation of the hit at one. 30 | 31 | Args: 32 | predictions: Matrix containing the outputs of the model. 33 | Dimensions are 'batch' x 'num_classes'. 34 | actuals: Matrix containing the ground truth labels. 35 | Dimensions are 'batch' x 'num_classes'. 36 | 37 | Returns: 38 | float: The average hit at one across the entire batch. 39 | """ 40 | top_prediction = numpy.argmax(predictions, 1) 41 | hits = actuals[numpy.arange(actuals.shape[0]), top_prediction] 42 | return numpy.average(hits) 43 | 44 | 45 | def calculate_precision_at_equal_recall_rate(predictions, actuals): 46 | """Performs a local (numpy) calculation of the PERR. 47 | 48 | Args: 49 | predictions: Matrix containing the outputs of the model. 50 | Dimensions are 'batch' x 'num_classes'. 51 | actuals: Matrix containing the ground truth labels. 52 | Dimensions are 'batch' x 'num_classes'. 53 | 54 | Returns: 55 | float: The average precision at equal recall rate across the entire batch. 56 | """ 57 | aggregated_precision = 0.0 58 | num_videos = actuals.shape[0] 59 | for row in numpy.arange(num_videos): 60 | num_labels = int(numpy.sum(actuals[row])) 61 | top_indices = numpy.argpartition(predictions[row], 62 | -num_labels)[-num_labels:] 63 | item_precision = 0.0 64 | for label_index in top_indices: 65 | if predictions[row][label_index] > 0: 66 | item_precision += actuals[row][label_index] 67 | item_precision /= top_indices.size 68 | aggregated_precision += item_precision 69 | aggregated_precision /= num_videos 70 | return aggregated_precision 71 | 72 | def calculate_gap(predictions, actuals, top_k=20): 73 | """Performs a local (numpy) calculation of the global average precision. 74 | 75 | Only the top_k predictions are taken for each of the videos. 76 | 77 | Args: 78 | predictions: Matrix containing the outputs of the model. 79 | Dimensions are 'batch' x 'num_classes'. 80 | actuals: Matrix containing the ground truth labels. 81 | Dimensions are 'batch' x 'num_classes'. 82 | top_k: How many predictions to use per video. 83 | 84 | Returns: 85 | float: The global average precision. 86 | """ 87 | gap_calculator = ap_calculator.AveragePrecisionCalculator() 88 | sparse_predictions, sparse_labels, num_positives = top_k_by_class(predictions, actuals, top_k) 89 | gap_calculator.accumulate(flatten(sparse_predictions), flatten(sparse_labels), sum(num_positives)) 90 | return gap_calculator.peek_ap_at_n() 91 | 92 | 93 | def top_k_by_class(predictions, labels, k=20): 94 | """Extracts the top k predictions for each video, sorted by class. 95 | 96 | Args: 97 | predictions: A numpy matrix containing the outputs of the model. 98 | Dimensions are 'batch' x 'num_classes'. 99 | k: the top k non-zero entries to preserve in each prediction. 100 | 101 | Returns: 102 | A tuple (predictions,labels, true_positives). 'predictions' and 'labels' 103 | are lists of lists of floats. 'true_positives' is a list of scalars. The 104 | length of the lists are equal to the number of classes. The entries in the 105 | predictions variable are probability predictions, and 106 | the corresponding entries in the labels variable are the ground truth for 107 | those predictions. The entries in 'true_positives' are the number of true 108 | positives for each class in the ground truth. 109 | 110 | Raises: 111 | ValueError: An error occurred when the k is not a positive integer. 112 | """ 113 | if k <= 0: 114 | raise ValueError("k must be a positive integer.") 115 | k = min(k, predictions.shape[1]) 116 | num_classes = predictions.shape[1] 117 | prediction_triplets= [] 118 | for video_index in range(predictions.shape[0]): 119 | prediction_triplets.extend(top_k_triplets(predictions[video_index],labels[video_index], k)) 120 | out_predictions = [[] for v in range(num_classes)] 121 | out_labels = [[] for v in range(num_classes)] 122 | for triplet in prediction_triplets: 123 | out_predictions[triplet[0]].append(triplet[1]) 124 | out_labels[triplet[0]].append(triplet[2]) 125 | out_true_positives = [numpy.sum(labels[:,i]) for i in range(num_classes)] 126 | 127 | return out_predictions, out_labels, out_true_positives 128 | 129 | def top_k_triplets(predictions, labels, k=20): 130 | """Get the top_k for a 1-d numpy array. Returns a sparse list of tuples in 131 | (prediction, class) format""" 132 | m = len(predictions) 133 | k = min(k, m) 134 | indices = numpy.argpartition(predictions, -k)[-k:] 135 | return [(index, predictions[index], labels[index]) for index in indices] 136 | 137 | class EvaluationMetrics(object): 138 | """A class to store the evaluation metrics.""" 139 | 140 | def __init__(self, num_class, top_k): 141 | """Construct an EvaluationMetrics object to store the evaluation metrics. 142 | 143 | Args: 144 | num_class: A positive integer specifying the number of classes. 145 | top_k: A positive integer specifying how many predictions are considered per video. 146 | 147 | Raises: 148 | ValueError: An error occurred when MeanAveragePrecisionCalculator cannot 149 | not be constructed. 150 | """ 151 | self.sum_hit_at_one = 0.0 152 | self.sum_perr = 0.0 153 | self.sum_loss = 0.0 154 | self.map_calculator = map_calculator.MeanAveragePrecisionCalculator(num_class) 155 | self.global_ap_calculator = ap_calculator.AveragePrecisionCalculator() 156 | self.top_k = top_k 157 | self.num_examples = 0 158 | 159 | def accumulate(self, predictions, labels, loss): 160 | """Accumulate the metrics calculated locally for this mini-batch. 161 | 162 | Args: 163 | predictions: A numpy matrix containing the outputs of the model. 164 | Dimensions are 'batch' x 'num_classes'. 165 | labels: A numpy matrix containing the ground truth labels. 166 | Dimensions are 'batch' x 'num_classes'. 167 | loss: A numpy array containing the loss for each sample. 168 | 169 | Returns: 170 | dictionary: A dictionary storing the metrics for the mini-batch. 171 | 172 | Raises: 173 | ValueError: An error occurred when the shape of predictions and actuals 174 | does not match. 175 | """ 176 | batch_size = labels.shape[0] 177 | mean_hit_at_one = calculate_hit_at_one(predictions, labels) 178 | mean_perr = calculate_precision_at_equal_recall_rate(predictions, labels) 179 | mean_loss = numpy.mean(loss) 180 | 181 | # Take the top 20 predictions. 182 | sparse_predictions, sparse_labels, num_positives = top_k_by_class(predictions, labels, self.top_k) 183 | self.map_calculator.accumulate(sparse_predictions, sparse_labels, num_positives) 184 | self.global_ap_calculator.accumulate(flatten(sparse_predictions), flatten(sparse_labels), sum(num_positives)) 185 | 186 | self.num_examples += batch_size 187 | self.sum_hit_at_one += mean_hit_at_one * batch_size 188 | self.sum_perr += mean_perr * batch_size 189 | self.sum_loss += mean_loss * batch_size 190 | 191 | return {"hit_at_one": mean_hit_at_one, "perr": mean_perr, "loss": mean_loss} 192 | 193 | def get(self): 194 | """Calculate the evaluation metrics for the whole epoch. 195 | 196 | Raises: 197 | ValueError: If no examples were accumulated. 198 | 199 | Returns: 200 | dictionary: a dictionary storing the evaluation metrics for the epoch. The 201 | dictionary has the fields: avg_hit_at_one, avg_perr, avg_loss, and 202 | aps (default nan). 203 | """ 204 | if self.num_examples <= 0: 205 | raise ValueError("total_sample must be positive.") 206 | avg_hit_at_one = self.sum_hit_at_one / self.num_examples 207 | avg_perr = self.sum_perr / self.num_examples 208 | avg_loss = self.sum_loss / self.num_examples 209 | 210 | aps = self.map_calculator.peek_map_at_n() 211 | gap = self.global_ap_calculator.peek_ap_at_n() 212 | 213 | epoch_info_dict = {} 214 | return {"avg_hit_at_one": avg_hit_at_one, "avg_perr": avg_perr, 215 | "avg_loss": avg_loss, "aps": aps, "gap": gap} 216 | 217 | def clear(self): 218 | """Clear the evaluation metrics and reset the EvaluationMetrics object.""" 219 | self.sum_hit_at_one = 0.0 220 | self.sum_perr = 0.0 221 | self.sum_loss = 0.0 222 | self.map_calculator.clear() 223 | self.global_ap_calculator.clear() 224 | self.num_examples = 0 225 | -------------------------------------------------------------------------------- /mnist/export_model.py: -------------------------------------------------------------------------------- 1 | # Copyright 2016 Google Inc. All Rights Reserved. 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # http://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS-IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | """Utilities to export a model for batch prediction.""" 15 | 16 | import tensorflow as tf 17 | import tensorflow.contrib.slim as slim 18 | 19 | from tensorflow.python.saved_model import builder as saved_model_builder 20 | from tensorflow.python.saved_model import signature_constants 21 | from tensorflow.python.saved_model import signature_def_utils 22 | from tensorflow.python.saved_model import tag_constants 23 | from tensorflow.python.saved_model import utils as saved_model_utils 24 | 25 | _TOP_PREDICTIONS_IN_OUTPUT = 2 26 | 27 | class ModelExporter(object): 28 | 29 | def __init__(self, model, reader): 30 | self.model = model 31 | self.reader = reader 32 | 33 | with tf.Graph().as_default() as graph: 34 | self.inputs, self.outputs = self.build_inputs_and_outputs() 35 | self.graph = graph 36 | self.saver = tf.train.Saver(tf.trainable_variables(), sharded=True) 37 | 38 | def export_model(self, model_dir, global_step_val, last_checkpoint): 39 | """Exports the model so that it can used for batch predictions.""" 40 | 41 | with self.graph.as_default(): 42 | with tf.Session() as session: 43 | session.run(tf.global_variables_initializer()) 44 | self.saver.restore(session, last_checkpoint) 45 | 46 | signature = signature_def_utils.build_signature_def( 47 | inputs=self.inputs, 48 | outputs=self.outputs, 49 | method_name=signature_constants.PREDICT_METHOD_NAME) 50 | 51 | signature_map = {signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY: 52 | signature} 53 | 54 | model_builder = saved_model_builder.SavedModelBuilder(model_dir) 55 | model_builder.add_meta_graph_and_variables(session, 56 | tags=[tag_constants.SERVING], 57 | signature_def_map=signature_map, 58 | clear_devices=True) 59 | model_builder.save() 60 | 61 | def build_inputs_and_outputs(self): 62 | serialized_examples = tf.placeholder(tf.string, shape=(None,)) 63 | index_output, predictions_output = ( 64 | self.build_prediction_graph(serialized_examples)) 65 | 66 | inputs = {"example_bytes": 67 | saved_model_utils.build_tensor_info(serialized_examples)} 68 | 69 | outputs = { 70 | "class_indexes": saved_model_utils.build_tensor_info(index_output), 71 | "predictions": saved_model_utils.build_tensor_info(predictions_output)} 72 | 73 | return inputs, outputs 74 | 75 | def build_prediction_graph(self, serialized_examples): 76 | model_input_raw, labels_batch = ( 77 | self.reader.prepare_serialized_examples(serialized_examples)) 78 | 79 | model_input = model_input_raw 80 | 81 | with tf.variable_scope("tower"): 82 | result = self.model.create_model( 83 | model_input, 84 | num_classes=self.reader.num_classes, 85 | labels=labels_batch, 86 | is_training=False) 87 | 88 | for variable in slim.get_model_variables(): 89 | tf.summary.histogram(variable.op.name, variable) 90 | 91 | predictions = result["predictions"] 92 | 93 | prediction, index = tf.nn.top_k(predictions, 1) 94 | return prediction, index 95 | -------------------------------------------------------------------------------- /mnist/inference.py: -------------------------------------------------------------------------------- 1 | # Copyright 2016 Google Inc. All Rights Reserved. 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # http://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS-IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | 15 | """Binary for generating predictions over a set of videos.""" 16 | 17 | import os 18 | import time 19 | 20 | import numpy 21 | import tensorflow as tf 22 | 23 | from tensorflow import app 24 | from tensorflow import flags 25 | from tensorflow import gfile 26 | from tensorflow import logging 27 | 28 | import eval_util 29 | import losses 30 | import readers 31 | import utils 32 | 33 | FLAGS = flags.FLAGS 34 | 35 | if __name__ == '__main__': 36 | flags.DEFINE_string("train_dir", "/tmp/mnist_model/", 37 | "The directory to load the model files from.") 38 | flags.DEFINE_string("output_file", "", 39 | "The file to save the predictions to.") 40 | flags.DEFINE_string( 41 | "input_data_pattern", "", 42 | "File glob defining the evaluation dataset in tensorflow.SequenceExample " 43 | "format. The SequenceExamples are expected to have an 'rgb' byte array " 44 | "sequence feature as well as a 'labels' int64 context feature.") 45 | 46 | # Other flags. 47 | flags.DEFINE_integer("num_readers", 1, 48 | "How many threads to use for reading input files.") 49 | 50 | def format_lines(predictions): 51 | batch_size = len(predictions) 52 | predictions = numpy.argmax(predictions, 1) 53 | for video_index in range(batch_size): 54 | yield "%d\n" % predictions[video_index] 55 | 56 | 57 | def get_input_data_tensors(reader, data_pattern, batch_size, num_readers=1): 58 | """Creates the section of the graph which reads the input data. 59 | 60 | Args: 61 | reader: A class which parses the input data. 62 | data_pattern: A 'glob' style path to the data files. 63 | batch_size: How many examples to process at a time. 64 | num_readers: How many I/O threads to use. 65 | 66 | Returns: 67 | A tuple containing the image id tensor, image tensor and labels tensor. 68 | 69 | Raises: 70 | IOError: If no files matching the given pattern were found. 71 | """ 72 | with tf.name_scope("input"): 73 | files = gfile.Glob(data_pattern) 74 | if not files: 75 | raise IOError("Unable to find input files. data_pattern='" + 76 | data_pattern + "'") 77 | logging.info("number of input files: " + str(len(files))) 78 | filename_queue = tf.train.string_input_producer( 79 | files, num_epochs=1, shuffle=False) 80 | examples_and_ids = [reader.prepare_reader(filename_queue, 1) 81 | for _ in range(num_readers)] 82 | 83 | image_batch, unused = ( 84 | tf.train.batch_join(examples_and_ids, 85 | batch_size=batch_size, 86 | allow_smaller_final_batch=True, 87 | enqueue_many=True)) 88 | return image_batch 89 | 90 | def inference(reader, train_dir, data_pattern, out_file_location, batch_size): 91 | with tf.Session(config=tf.ConfigProto(allow_soft_placement=True)) as sess, gfile.Open(out_file_location, "w+") as out_file: 92 | image_batch = get_input_data_tensors(reader, data_pattern, batch_size) 93 | latest_checkpoint = tf.train.latest_checkpoint(train_dir) 94 | if latest_checkpoint is None: 95 | raise Exception("unable to find a checkpoint at location: %s" % train_dir) 96 | else: 97 | meta_graph_location = latest_checkpoint + ".meta" 98 | logging.info("loading meta-graph: " + meta_graph_location) 99 | saver = tf.train.import_meta_graph(meta_graph_location, clear_devices=True) 100 | logging.info("restoring variables from " + latest_checkpoint) 101 | saver.restore(sess, latest_checkpoint) 102 | input_tensor = tf.get_collection("input_batch_raw")[0] 103 | predictions_tensor = tf.get_collection("predictions")[0] 104 | 105 | # Workaround for num_epochs issue. 106 | def set_up_init_ops(variables): 107 | init_op_list = [] 108 | for variable in list(variables): 109 | if "train_input" in variable.name: 110 | init_op_list.append(tf.assign(variable, 1)) 111 | variables.remove(variable) 112 | init_op_list.append(tf.variables_initializer(variables)) 113 | return init_op_list 114 | 115 | sess.run(set_up_init_ops(tf.get_collection_ref( 116 | tf.GraphKeys.LOCAL_VARIABLES))) 117 | 118 | coord = tf.train.Coordinator() 119 | threads = tf.train.start_queue_runners(sess=sess, coord=coord) 120 | num_examples_processed = 0 121 | start_time = time.time() 122 | out_file.write("Id,Category\n") 123 | 124 | try: 125 | line_id = 1 126 | while not coord.should_stop(): 127 | image_batch_val = sess.run(image_batch) 128 | predictions_val = sess.run(predictions_tensor, feed_dict={input_tensor: image_batch_val}) 129 | now = time.time() 130 | num_examples_processed += len(image_batch_val) 131 | num_classes = predictions_val.shape[1] 132 | logging.info("num examples processed: " + str(num_examples_processed) + " elapsed seconds: " + "{0:.2f}".format(now-start_time)) 133 | for line in format_lines(predictions_val): 134 | out_file.write("%d,%s" % (line_id, line)) 135 | line_id += 1 136 | out_file.flush() 137 | 138 | 139 | except tf.errors.OutOfRangeError: 140 | logging.info('Done with inference. The output file was written to ' + out_file_location) 141 | finally: 142 | coord.request_stop() 143 | 144 | coord.join(threads) 145 | sess.close() 146 | 147 | 148 | def main(unused_argv): 149 | logging.set_verbosity(tf.logging.INFO) 150 | 151 | reader = readers.MnistReader() 152 | 153 | if FLAGS.output_file is "": 154 | raise ValueError("'output_file' was not specified. " 155 | "Unable to continue with inference.") 156 | 157 | if FLAGS.input_data_pattern is "": 158 | raise ValueError("'input_data_pattern' was not specified. " 159 | "Unable to continue with inference.") 160 | 161 | inference(reader, FLAGS.train_dir, FLAGS.input_data_pattern, 162 | FLAGS.output_file, 8192) 163 | 164 | 165 | if __name__ == "__main__": 166 | app.run() 167 | -------------------------------------------------------------------------------- /mnist/losses.py: -------------------------------------------------------------------------------- 1 | # Copyright 2016 Google Inc. All Rights Reserved. 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # http://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS-IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | 15 | """Provides definitions for non-regularized training or test losses.""" 16 | 17 | import tensorflow as tf 18 | 19 | 20 | class BaseLoss(object): 21 | """Inherit from this class when implementing new losses.""" 22 | 23 | def calculate_loss(self, unused_predictions, unused_labels, **unused_params): 24 | """Calculates the average loss of the examples in a mini-batch. 25 | 26 | Args: 27 | unused_predictions: a 2-d tensor storing the prediction scores, in which 28 | each row represents a sample in the mini-batch and each column 29 | represents a class. 30 | unused_labels: a 2-d tensor storing the labels, which has the same shape 31 | as the unused_predictions. The labels must be in the range of 0 and 1. 32 | unused_params: loss specific parameters. 33 | 34 | Returns: 35 | A scalar loss tensor. 36 | """ 37 | raise NotImplementedError() 38 | 39 | 40 | class CrossEntropyLoss(BaseLoss): 41 | """Calculate the cross entropy loss between the predictions and labels. 42 | """ 43 | 44 | def calculate_loss(self, predictions, labels, **unused_params): 45 | with tf.name_scope("loss_xent"): 46 | epsilon = 10e-6 47 | float_labels = tf.cast(labels, tf.float32) 48 | cross_entropy_loss = float_labels * tf.log(predictions + epsilon) + ( 49 | 1 - float_labels) * tf.log(1 - predictions + epsilon) 50 | cross_entropy_loss = tf.negative(cross_entropy_loss) 51 | return tf.reduce_mean(tf.reduce_sum(cross_entropy_loss, 1)) 52 | 53 | 54 | class HingeLoss(BaseLoss): 55 | """Calculate the hinge loss between the predictions and labels. 56 | 57 | Note the subgradient is used in the backpropagation, and thus the optimization 58 | may converge slower. The predictions trained by the hinge loss are between -1 59 | and +1. 60 | """ 61 | 62 | def calculate_loss(self, predictions, labels, b=1.0, **unused_params): 63 | with tf.name_scope("loss_hinge"): 64 | float_labels = tf.cast(labels, tf.float32) 65 | all_zeros = tf.zeros(tf.shape(float_labels), dtype=tf.float32) 66 | all_ones = tf.ones(tf.shape(float_labels), dtype=tf.float32) 67 | sign_labels = tf.subtract(tf.scalar_mul(2, float_labels), all_ones) 68 | hinge_loss = tf.maximum( 69 | all_zeros, tf.scalar_mul(b, all_ones) - sign_labels * predictions) 70 | return tf.reduce_mean(tf.reduce_sum(hinge_loss, 1)) 71 | 72 | 73 | class SoftmaxLoss(BaseLoss): 74 | """Calculate the softmax loss between the predictions and labels. 75 | 76 | The function calculates the loss in the following way: first we feed the 77 | predictions to the softmax activation function and then we calculate 78 | the minus linear dot product between the logged softmax activations and the 79 | normalized ground truth label. 80 | 81 | It is an extension to the one-hot label. It allows for more than one positive 82 | labels for each sample. 83 | """ 84 | 85 | def calculate_loss(self, predictions, labels, **unused_params): 86 | with tf.name_scope("loss_softmax"): 87 | epsilon = 10e-8 88 | float_labels = tf.cast(labels, tf.float32) 89 | # l1 normalization (labels are no less than 0) 90 | label_rowsum = tf.maximum( 91 | tf.reduce_sum(float_labels, 1, keep_dims=True), 92 | epsilon) 93 | norm_float_labels = tf.div(float_labels, label_rowsum) 94 | softmax_outputs = tf.nn.softmax(predictions) 95 | softmax_loss = tf.negative(tf.reduce_sum( 96 | tf.multiply(norm_float_labels, tf.log(softmax_outputs)), 1)) 97 | return tf.reduce_mean(softmax_loss) 98 | -------------------------------------------------------------------------------- /mnist/mean_average_precision_calculator.py: -------------------------------------------------------------------------------- 1 | # Copyright 2016 Google Inc. All Rights Reserved. 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # http://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS-IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | 15 | """Calculate the mean average precision. 16 | 17 | It provides an interface for calculating mean average precision 18 | for an entire list or the top-n ranked items. 19 | 20 | Example usages: 21 | We first call the function accumulate many times to process parts of the ranked 22 | list. After processing all the parts, we call peek_map_at_n 23 | to calculate the mean average precision. 24 | 25 | ``` 26 | import random 27 | 28 | p = np.array([[random.random() for _ in xrange(50)] for _ in xrange(1000)]) 29 | a = np.array([[random.choice([0, 1]) for _ in xrange(50)] 30 | for _ in xrange(1000)]) 31 | 32 | # mean average precision for 50 classes. 33 | calculator = mean_average_precision_calculator.MeanAveragePrecisionCalculator( 34 | num_class=50) 35 | calculator.accumulate(p, a) 36 | aps = calculator.peek_map_at_n() 37 | ``` 38 | """ 39 | 40 | import numpy 41 | import average_precision_calculator 42 | 43 | 44 | class MeanAveragePrecisionCalculator(object): 45 | """This class is to calculate mean average precision. 46 | """ 47 | 48 | def __init__(self, num_class): 49 | """Construct a calculator to calculate the (macro) average precision. 50 | 51 | Args: 52 | num_class: A positive Integer specifying the number of classes. 53 | top_n_array: A list of positive integers specifying the top n for each 54 | class. The top n in each class will be used to calculate its average 55 | precision at n. 56 | The size of the array must be num_class. 57 | 58 | Raises: 59 | ValueError: An error occurred when num_class is not a positive integer; 60 | or the top_n_array is not a list of positive integers. 61 | """ 62 | if not isinstance(num_class, int) or num_class <= 1: 63 | raise ValueError("num_class must be a positive integer.") 64 | 65 | self._ap_calculators = [] # member of AveragePrecisionCalculator 66 | self._num_class = num_class # total number of classes 67 | for i in range(num_class): 68 | self._ap_calculators.append( 69 | average_precision_calculator.AveragePrecisionCalculator()) 70 | 71 | def accumulate(self, predictions, actuals, num_positives=None): 72 | """Accumulate the predictions and their ground truth labels. 73 | 74 | Args: 75 | predictions: A list of lists storing the prediction scores. The outer 76 | dimension corresponds to classes. 77 | actuals: A list of lists storing the ground truth labels. The dimensions 78 | should correspond to the predictions input. Any value 79 | larger than 0 will be treated as positives, otherwise as negatives. 80 | num_positives: If provided, it is a list of numbers representing the 81 | number of true positives for each class. If not provided, the number of 82 | true positives will be inferred from the 'actuals' array. 83 | 84 | Raises: 85 | ValueError: An error occurred when the shape of predictions and actuals 86 | does not match. 87 | """ 88 | if not num_positives: 89 | num_positives = [None for i in predictions.shape[1]] 90 | 91 | calculators = self._ap_calculators 92 | for i in range(len(predictions)): 93 | calculators[i].accumulate(predictions[i], actuals[i], num_positives[i]) 94 | 95 | def clear(self): 96 | for calculator in self._ap_calculators: 97 | calculator.clear() 98 | 99 | def is_empty(self): 100 | return ([calculator.heap_size for calculator in self._ap_calculators] == 101 | [0 for _ in range(self._num_class)]) 102 | 103 | def peek_map_at_n(self): 104 | """Peek the non-interpolated mean average precision at n. 105 | 106 | Returns: 107 | An array of non-interpolated average precision at n (default 0) for each 108 | class. 109 | """ 110 | aps = [self._ap_calculators[i].peek_ap_at_n() 111 | for i in range(self._num_class)] 112 | return aps 113 | -------------------------------------------------------------------------------- /mnist/mnist_models.py: -------------------------------------------------------------------------------- 1 | # Copyright 2016 Google Inc. All Rights Reserved. 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # http://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS-IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | 15 | """Contains model definitions.""" 16 | import math 17 | 18 | import models 19 | import tensorflow as tf 20 | import utils 21 | 22 | import tensorflow.contrib.slim as slim 23 | 24 | class LogisticModel(models.BaseModel): 25 | """Logistic model with L2 regularization.""" 26 | 27 | def create_model(self, model_input, num_classes=10, l2_penalty=1e-8, **unused_params): 28 | """Creates a logistic model. 29 | 30 | Args: 31 | model_input: 'batch' x 'num_features' matrix of input features. 32 | vocab_size: The number of classes in the dataset. 33 | 34 | Returns: 35 | A dictionary with a tensor containing the probability predictions of the 36 | model in the 'predictions' key. The dimensions of the tensor are 37 | batch_size x num_classes.""" 38 | output = slim.fully_connected( 39 | model_input, num_classes, activation_fn=tf.nn.softmax, 40 | weights_regularizer=slim.l2_regularizer(l2_penalty)) 41 | return {"predictions": output} 42 | -------------------------------------------------------------------------------- /mnist/models.py: -------------------------------------------------------------------------------- 1 | # Copyright 2016 Google Inc. All Rights Reserved. 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # http://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS-IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | 15 | """Contains the base class for models.""" 16 | 17 | class BaseModel(object): 18 | """Inherit from this class when implementing new models.""" 19 | 20 | def create_model(self, unused_model_input, **unused_params): 21 | raise NotImplementedError() 22 | -------------------------------------------------------------------------------- /mnist/readers.py: -------------------------------------------------------------------------------- 1 | # Copyright 2016 Google Inc. All Rights Reserved. 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # http://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS-IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | 15 | """Provides readers configured for different datasets.""" 16 | 17 | import tensorflow as tf 18 | import utils 19 | import tensorflow.contrib.slim as slim 20 | 21 | from tensorflow import logging 22 | import numpy as np 23 | 24 | def resize_axis(tensor, axis, new_size, fill_value=0): 25 | """Truncates or pads a tensor to new_size on on a given axis. 26 | 27 | Truncate or extend tensor such that tensor.shape[axis] == new_size. If the 28 | size increases, the padding will be performed at the end, using fill_value. 29 | 30 | Args: 31 | tensor: The tensor to be resized. 32 | axis: An integer representing the dimension to be sliced. 33 | new_size: An integer or 0d tensor representing the new value for 34 | tensor.shape[axis]. 35 | fill_value: Value to use to fill any new entries in the tensor. Will be 36 | cast to the type of tensor. 37 | 38 | Returns: 39 | The resized tensor. 40 | """ 41 | tensor = tf.convert_to_tensor(tensor) 42 | shape = tf.unstack(tf.shape(tensor)) 43 | 44 | pad_shape = shape[:] 45 | pad_shape[axis] = tf.maximum(0, new_size - shape[axis]) 46 | 47 | shape[axis] = tf.minimum(shape[axis], new_size) 48 | shape = tf.stack(shape) 49 | 50 | resized = tf.concat([ 51 | tf.slice(tensor, tf.zeros_like(shape), shape), 52 | tf.fill(tf.stack(pad_shape), tf.cast(fill_value, tensor.dtype)) 53 | ], axis) 54 | 55 | # Update shape. 56 | new_shape = tensor.get_shape().as_list() # A copy is being made. 57 | new_shape[axis] = new_size 58 | resized.set_shape(new_shape) 59 | return resized 60 | 61 | class BaseReader(object): 62 | """Inherit from this class when implementing new readers.""" 63 | 64 | def prepare_reader(self, unused_filename_queue): 65 | """Create a thread for generating prediction and label tensors.""" 66 | raise NotImplementedError() 67 | 68 | 69 | class MnistReader(BaseReader): 70 | """Reads TFRecords of pre-aggregated Examples. 71 | """ 72 | 73 | def __init__(self, 74 | num_classes=10): 75 | self.num_classes = num_classes 76 | 77 | def prepare_reader(self, filename_queue, batch_size=1024): 78 | reader = tf.TFRecordReader() 79 | _, serialized_examples = reader.read_up_to(filename_queue, batch_size) 80 | 81 | print("start " + str(serialized_examples.get_shape().ndims)) 82 | tf.add_to_collection("serialized_examples", serialized_examples) 83 | return self.prepare_serialized_examples(serialized_examples) 84 | 85 | def prepare_serialized_examples(self, serialized_examples): 86 | feature_map = { 87 | 'image_raw': tf.FixedLenFeature([784], tf.int64), 88 | 'label': tf.FixedLenFeature([], tf.int64), 89 | } 90 | features = tf.parse_example(serialized_examples, features=feature_map) 91 | 92 | images = tf.cast(features["image_raw"], tf.float32) * (1. / 255) 93 | labels = tf.cast(features['label'], tf.int32) 94 | 95 | def dense_to_one_hot(label_batch, num_classes): 96 | one_hot = tf.map_fn(lambda x : tf.cast(slim.one_hot_encoding(x, num_classes), tf.int32), label_batch) 97 | one_hot = tf.reshape(one_hot, [-1, num_classes]) 98 | return one_hot 99 | 100 | labels = dense_to_one_hot(labels, 10) 101 | return images, labels 102 | -------------------------------------------------------------------------------- /mnist/train.py: -------------------------------------------------------------------------------- 1 | # Copyright 2016 Google Inc. All Rights Reserved. 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # http://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS-IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | """Binary for training Tensorflow models on the YouTube-8M dataset.""" 15 | 16 | import json 17 | import os 18 | import time 19 | 20 | import eval_util 21 | import export_model 22 | import losses 23 | import mnist_models 24 | import readers 25 | import tensorflow as tf 26 | import tensorflow.contrib.slim as slim 27 | from tensorflow import app 28 | from tensorflow import flags 29 | from tensorflow import gfile 30 | from tensorflow import logging 31 | from tensorflow.python.client import device_lib 32 | import utils 33 | 34 | FLAGS = flags.FLAGS 35 | 36 | if __name__ == "__main__": 37 | # Dataset flags. 38 | flags.DEFINE_string("train_dir", "/tmp/mnist_model/", 39 | "The directory to save the model files in.") 40 | flags.DEFINE_string( 41 | "train_data_pattern", "", 42 | "File glob for the training dataset.") 43 | 44 | flags.DEFINE_string( 45 | "model", "LogisticModel", 46 | "Which architecture to use for the model. Models are defined " 47 | "in models.py.") 48 | flags.DEFINE_bool( 49 | "start_new_model", False, 50 | "If set, this will not resume from a checkpoint and will instead create a" 51 | " new model instance.") 52 | 53 | # Training flags. 54 | flags.DEFINE_integer("batch_size", 512, 55 | "How many examples to process per batch for training.") 56 | flags.DEFINE_string("label_loss", "CrossEntropyLoss", 57 | "Which loss function to use for training the model.") 58 | flags.DEFINE_float( 59 | "regularization_penalty", 1.0, 60 | "How much weight to give to the regularization loss (the label loss has " 61 | "a weight of 1).") 62 | 63 | flags.DEFINE_float("base_learning_rate", 0.01, 64 | "Which learning rate to start with.") 65 | 66 | flags.DEFINE_float("learning_rate_decay", 0.95, 67 | "Learning rate decay factor to be applied every " 68 | "learning_rate_decay_examples.") 69 | 70 | flags.DEFINE_float("learning_rate_decay_examples", 4000000, 71 | "Multiply current learning rate by learning_rate_decay " 72 | "every learning_rate_decay_examples.") 73 | 74 | flags.DEFINE_integer("num_epochs", 100, 75 | "How many passes to make over the dataset before " 76 | "halting training.") 77 | 78 | flags.DEFINE_integer("max_steps", None, 79 | "The maximum number of iterations of the training loop.") 80 | 81 | flags.DEFINE_integer("export_model_steps", 1000, 82 | "The period, in number of steps, with which the model " 83 | "is exported for batch prediction.") 84 | 85 | # Other flags. 86 | flags.DEFINE_integer("num_readers", 1, 87 | "How many threads to use for reading input files.") 88 | 89 | flags.DEFINE_string("optimizer", "AdamOptimizer", 90 | "What optimizer class to use.") 91 | 92 | flags.DEFINE_float("clip_gradient_norm", 1.0, "Norm to clip gradients to.") 93 | 94 | flags.DEFINE_bool( 95 | "log_device_placement", False, 96 | "Whether to write the device on which every op will run into the " 97 | "logs on startup.") 98 | 99 | def validate_class_name(flag_value, category, modules, expected_superclass): 100 | """Checks that the given string matches a class of the expected type. 101 | 102 | Args: 103 | flag_value: A string naming the class to instantiate. 104 | category: A string used further describe the class in error messages 105 | (e.g. 'model', 'reader', 'loss'). 106 | modules: A list of modules to search for the given class. 107 | expected_superclass: A class that the given class should inherit from. 108 | 109 | Raises: 110 | FlagsError: If the given class could not be found or if the first class 111 | found with that name doesn't inherit from the expected superclass. 112 | 113 | Returns: 114 | True if a class was found that matches the given constraints. 115 | """ 116 | candidates = [getattr(module, flag_value, None) for module in modules] 117 | for candidate in candidates: 118 | if not candidate: 119 | continue 120 | if not issubclass(candidate, expected_superclass): 121 | raise flags.FlagsError("%s '%s' doesn't inherit from %s." % 122 | (category, flag_value, 123 | expected_superclass.__name__)) 124 | return True 125 | raise flags.FlagsError("Unable to find %s '%s'." % (category, flag_value)) 126 | 127 | def get_input_data_tensors(reader, 128 | data_pattern, 129 | batch_size=1000, 130 | num_epochs=None, 131 | num_readers=1): 132 | """Creates the section of the graph which reads the training data. 133 | 134 | Args: 135 | reader: A class which parses the training data. 136 | data_pattern: A 'glob' style path to the data files. 137 | batch_size: How many examples to process at a time. 138 | num_epochs: How many passes to make over the training data. Set to 'None' 139 | to run indefinitely. 140 | num_readers: How many I/O threads to use. 141 | 142 | Returns: 143 | A tuple containing the ids tensor, images tensor, labels tensor. 144 | 145 | Raises: 146 | IOError: If no files matching the given pattern were found. 147 | """ 148 | logging.info("Using batch size of " + str(batch_size) + " for training.") 149 | with tf.name_scope("train_input"): 150 | files = gfile.Glob(data_pattern) 151 | if not files: 152 | raise IOError("Unable to find training files. data_pattern='" + 153 | data_pattern + "'.") 154 | logging.info("Number of training files: %s.", str(len(files))) 155 | filename_queue = tf.train.string_input_producer( 156 | files, num_epochs=num_epochs, shuffle=True) 157 | 158 | training_data = [ 159 | reader.prepare_reader(filename_queue, FLAGS.batch_size) for _ in range(num_readers) 160 | ] 161 | 162 | return tf.train.shuffle_batch_join( 163 | training_data, 164 | batch_size=batch_size, 165 | capacity=batch_size * 5, 166 | min_after_dequeue=batch_size, 167 | allow_smaller_final_batch=True, 168 | enqueue_many=True) 169 | 170 | 171 | def find_class_by_name(name, modules): 172 | """Searches the provided modules for the named class and returns it.""" 173 | modules = [getattr(module, name, None) for module in modules] 174 | return next(a for a in modules if a) 175 | 176 | def build_graph(reader, 177 | model, 178 | train_data_pattern, 179 | label_loss_fn=losses.SoftmaxLoss(), 180 | batch_size=1000, 181 | base_learning_rate=0.01, 182 | learning_rate_decay_examples=1000000, 183 | learning_rate_decay=0.95, 184 | optimizer_class=tf.train.AdamOptimizer, 185 | clip_gradient_norm=1.0, 186 | regularization_penalty=1, 187 | num_readers=1, 188 | num_epochs=None): 189 | """Creates the Tensorflow graph. 190 | 191 | This will only be called once in the life of 192 | a training model, because after the graph is created the model will be 193 | restored from a meta graph file rather than being recreated. 194 | 195 | Args: 196 | reader: The data file reader. It should inherit from BaseReader. 197 | model: The core model (e.g. logistic or neural net). It should inherit 198 | from BaseModel. 199 | train_data_pattern: glob path to the training data files. 200 | label_loss_fn: What kind of loss to apply to the model. It should inherit 201 | from BaseLoss. 202 | batch_size: How many examples to process at a time. 203 | base_learning_rate: What learning rate to initialize the optimizer with. 204 | optimizer_class: Which optimization algorithm to use. 205 | clip_gradient_norm: Magnitude of the gradient to clip to. 206 | regularization_penalty: How much weight to give the regularization loss 207 | compared to the label loss. 208 | num_readers: How many threads to use for I/O operations. 209 | num_epochs: How many passes to make over the data. 'None' means an 210 | unlimited number of passes. 211 | """ 212 | 213 | global_step = tf.Variable(0, trainable=False, name="global_step") 214 | 215 | local_device_protos = device_lib.list_local_devices() 216 | gpus = [x.name for x in local_device_protos if x.device_type == 'GPU'] 217 | num_gpus = len(gpus) 218 | 219 | if num_gpus > 0: 220 | logging.info("Using the following GPUs to train: " + str(gpus)) 221 | num_towers = num_gpus 222 | device_string = '/gpu:%d' 223 | else: 224 | logging.info("No GPUs found. Training on CPU.") 225 | num_towers = 1 226 | device_string = '/cpu:%d' 227 | 228 | learning_rate = tf.train.exponential_decay( 229 | base_learning_rate, 230 | global_step * batch_size * num_towers, 231 | learning_rate_decay_examples, 232 | learning_rate_decay, 233 | staircase=True) 234 | tf.summary.scalar('learning_rate', learning_rate) 235 | 236 | optimizer = optimizer_class(learning_rate) 237 | model_input_raw, labels_batch = ( 238 | get_input_data_tensors( 239 | reader, 240 | train_data_pattern, 241 | batch_size=batch_size * num_towers, 242 | num_readers=num_readers, 243 | num_epochs=num_epochs)) 244 | tf.summary.histogram("model/input_raw", model_input_raw) 245 | 246 | model_input = model_input_raw 247 | 248 | tower_inputs = tf.split(model_input, num_towers) 249 | tower_labels = tf.split(labels_batch, num_towers) 250 | tower_gradients = [] 251 | tower_predictions = [] 252 | tower_label_losses = [] 253 | tower_reg_losses = [] 254 | for i in range(num_towers): 255 | # For some reason these 'with' statements can't be combined onto the same 256 | # line. They have to be nested. 257 | with tf.device(device_string % i): 258 | with (tf.variable_scope(("tower"), reuse=True if i > 0 else None)): 259 | with (slim.arg_scope([slim.model_variable, slim.variable], device="/cpu:0" if num_gpus!=1 else "/gpu:0")): 260 | result = model.create_model( 261 | tower_inputs[i], 262 | vocab_size=reader.num_classes, 263 | labels=tower_labels[i]) 264 | for variable in slim.get_model_variables(): 265 | tf.summary.histogram(variable.op.name, variable) 266 | 267 | predictions = result["predictions"] 268 | tower_predictions.append(predictions) 269 | 270 | if "loss" in result.keys(): 271 | label_loss = result["loss"] 272 | else: 273 | label_loss = label_loss_fn.calculate_loss(predictions, tower_labels[i]) 274 | 275 | if "regularization_loss" in result.keys(): 276 | reg_loss = result["regularization_loss"] 277 | else: 278 | reg_loss = tf.constant(0.0) 279 | 280 | reg_losses = tf.losses.get_regularization_losses() 281 | if reg_losses: 282 | reg_loss += tf.add_n(reg_losses) 283 | 284 | tower_reg_losses.append(reg_loss) 285 | 286 | # Adds update_ops (e.g., moving average updates in batch normalization) as 287 | # a dependency to the train_op. 288 | update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS) 289 | if "update_ops" in result.keys(): 290 | update_ops += result["update_ops"] 291 | if update_ops: 292 | with tf.control_dependencies(update_ops): 293 | barrier = tf.no_op(name="gradient_barrier") 294 | with tf.control_dependencies([barrier]): 295 | label_loss = tf.identity(label_loss) 296 | 297 | tower_label_losses.append(label_loss) 298 | 299 | # Incorporate the L2 weight penalties etc. 300 | final_loss = regularization_penalty * reg_loss + label_loss 301 | gradients = optimizer.compute_gradients(final_loss, 302 | colocate_gradients_with_ops=False) 303 | tower_gradients.append(gradients) 304 | label_loss = tf.reduce_mean(tf.stack(tower_label_losses)) 305 | tf.summary.scalar("label_loss", label_loss) 306 | if regularization_penalty != 0: 307 | reg_loss = tf.reduce_mean(tf.stack(tower_reg_losses)) 308 | tf.summary.scalar("reg_loss", reg_loss) 309 | merged_gradients = utils.combine_gradients(tower_gradients) 310 | 311 | if clip_gradient_norm > 0: 312 | with tf.name_scope('clip_grads'): 313 | merged_gradients = utils.clip_gradient_norms(merged_gradients, clip_gradient_norm) 314 | 315 | train_op = optimizer.apply_gradients(merged_gradients, global_step=global_step) 316 | 317 | tf.add_to_collection("global_step", global_step) 318 | tf.add_to_collection("loss", label_loss) 319 | tf.add_to_collection("predictions", tf.concat(tower_predictions, 0)) 320 | tf.add_to_collection("input_batch_raw", model_input_raw) 321 | tf.add_to_collection("input_batch", model_input) 322 | tf.add_to_collection("labels", tf.cast(labels_batch, tf.float32)) 323 | tf.add_to_collection("train_op", train_op) 324 | 325 | 326 | class Trainer(object): 327 | """A Trainer to train a Tensorflow graph.""" 328 | 329 | def __init__(self, cluster, task, train_dir, model, reader, model_exporter, 330 | log_device_placement=True, max_steps=None, 331 | export_model_steps=1000): 332 | """"Creates a Trainer. 333 | 334 | Args: 335 | cluster: A tf.train.ClusterSpec if the execution is distributed. 336 | None otherwise. 337 | task: A TaskSpec describing the job type and the task index. 338 | """ 339 | 340 | self.cluster = cluster 341 | self.task = task 342 | self.is_master = (task.type == "master" and task.index == 0) 343 | self.train_dir = train_dir 344 | self.config = tf.ConfigProto( 345 | allow_soft_placement=True,log_device_placement=log_device_placement) 346 | self.model = model 347 | self.reader = reader 348 | self.model_exporter = model_exporter 349 | self.max_steps = max_steps 350 | self.max_steps_reached = False 351 | self.export_model_steps = export_model_steps 352 | self.last_model_export_step = 0 353 | 354 | # if self.is_master and self.task.index > 0: 355 | # raise StandardError("%s: Only one replica of master expected", 356 | # task_as_string(self.task)) 357 | 358 | def run(self, start_new_model=False): 359 | """Performs training on the currently defined Tensorflow graph. 360 | 361 | Returns: 362 | A tuple of the training Hit@1 and the training PERR. 363 | """ 364 | if self.is_master and start_new_model: 365 | self.remove_training_directory(self.train_dir) 366 | 367 | target, device_fn = self.start_server_if_distributed() 368 | 369 | meta_filename = self.get_meta_filename(start_new_model, self.train_dir) 370 | 371 | with tf.Graph().as_default() as graph: 372 | 373 | if meta_filename: 374 | saver = self.recover_model(meta_filename) 375 | 376 | with tf.device(device_fn): 377 | if not meta_filename: 378 | saver = self.build_model(self.model, self.reader) 379 | 380 | global_step = tf.get_collection("global_step")[0] 381 | loss = tf.get_collection("loss")[0] 382 | predictions = tf.get_collection("predictions")[0] 383 | labels = tf.get_collection("labels")[0] 384 | train_op = tf.get_collection("train_op")[0] 385 | init_op = tf.global_variables_initializer() 386 | 387 | sv = tf.train.Supervisor( 388 | graph, 389 | logdir=self.train_dir, 390 | init_op=init_op, 391 | is_chief=self.is_master, 392 | global_step=global_step, 393 | save_model_secs=15 * 60, 394 | save_summaries_secs=120, 395 | saver=saver) 396 | 397 | logging.info("%s: Starting managed session.", task_as_string(self.task)) 398 | with sv.managed_session(target, config=self.config) as sess: 399 | try: 400 | logging.info("%s: Entering training loop.", task_as_string(self.task)) 401 | while (not sv.should_stop()) and (not self.max_steps_reached): 402 | batch_start_time = time.time() 403 | _, global_step_val, loss_val, predictions_val, labels_val = sess.run( 404 | [train_op, global_step, loss, predictions, labels]) 405 | seconds_per_batch = time.time() - batch_start_time 406 | examples_per_second = labels_val.shape[0] / seconds_per_batch 407 | 408 | if self.max_steps and self.max_steps <= global_step_val: 409 | self.max_steps_reached = True 410 | 411 | if self.is_master and global_step_val % 10 == 0 and self.train_dir: 412 | eval_start_time = time.time() 413 | hit_at_one = eval_util.calculate_hit_at_one(predictions_val, labels_val) 414 | perr = eval_util.calculate_precision_at_equal_recall_rate(predictions_val, 415 | labels_val) 416 | gap = eval_util.calculate_gap(predictions_val, labels_val) 417 | eval_end_time = time.time() 418 | eval_time = eval_end_time - eval_start_time 419 | 420 | logging.info("training step " + str(global_step_val) + " | Loss: " + ("%.2f" % loss_val) + 421 | " Examples/sec: " + ("%.2f" % examples_per_second) + " | Hit@1: " + 422 | ("%.2f" % hit_at_one) + " PERR: " + ("%.2f" % perr) + 423 | " GAP: " + ("%.2f" % gap)) 424 | 425 | sv.summary_writer.add_summary( 426 | utils.MakeSummary("model/Training_Hit@1", hit_at_one), 427 | global_step_val) 428 | sv.summary_writer.add_summary( 429 | utils.MakeSummary("model/Training_Perr", perr), global_step_val) 430 | sv.summary_writer.add_summary( 431 | utils.MakeSummary("model/Training_GAP", gap), global_step_val) 432 | sv.summary_writer.add_summary( 433 | utils.MakeSummary("global_step/Examples/Second", 434 | examples_per_second), global_step_val) 435 | sv.summary_writer.flush() 436 | 437 | # Exporting the model every x steps 438 | time_to_export = ((self.last_model_export_step == 0) or 439 | (global_step_val - self.last_model_export_step 440 | >= self.export_model_steps)) 441 | 442 | if self.is_master and time_to_export: 443 | self.export_model(global_step_val, sv.saver, sv.save_path, sess) 444 | self.last_model_export_step = global_step_val 445 | else: 446 | logging.info("training step " + str(global_step_val) + " | Loss: " + 447 | ("%.2f" % loss_val) + " Examples/sec: " + ("%.2f" % examples_per_second)) 448 | except tf.errors.OutOfRangeError: 449 | logging.info("%s: Done training -- epoch limit reached.", 450 | task_as_string(self.task)) 451 | 452 | logging.info("%s: Exited training loop.", task_as_string(self.task)) 453 | sv.Stop() 454 | 455 | def export_model(self, global_step_val, saver, save_path, session): 456 | 457 | # If the model has already been exported at this step, return. 458 | if global_step_val == self.last_model_export_step: 459 | return 460 | 461 | last_checkpoint = saver.save(session, save_path, global_step_val) 462 | 463 | model_dir = "{0}/export/step_{1}".format(self.train_dir, global_step_val) 464 | logging.info("%s: Exporting the model at step %s to %s.", 465 | task_as_string(self.task), global_step_val, model_dir) 466 | 467 | self.model_exporter.export_model( 468 | model_dir=model_dir, 469 | global_step_val=global_step_val, 470 | last_checkpoint=last_checkpoint) 471 | 472 | 473 | def start_server_if_distributed(self): 474 | """Starts a server if the execution is distributed.""" 475 | 476 | if self.cluster: 477 | logging.info("%s: Starting trainer within cluster %s.", 478 | task_as_string(self.task), self.cluster.as_dict()) 479 | server = start_server(self.cluster, self.task) 480 | target = server.target 481 | device_fn = tf.train.replica_device_setter( 482 | ps_device="/job:ps", 483 | worker_device="/job:%s/task:%d" % (self.task.type, self.task.index), 484 | cluster=self.cluster) 485 | else: 486 | target = "" 487 | device_fn = "" 488 | return (target, device_fn) 489 | 490 | def remove_training_directory(self, train_dir): 491 | """Removes the training directory.""" 492 | try: 493 | logging.info( 494 | "%s: Removing existing train directory.", 495 | task_as_string(self.task)) 496 | gfile.DeleteRecursively(train_dir) 497 | except: 498 | logging.error( 499 | "%s: Failed to delete directory " + train_dir + 500 | " when starting a new model. Please delete it manually and" + 501 | " try again.", task_as_string(self.task)) 502 | 503 | def get_meta_filename(self, start_new_model, train_dir): 504 | if start_new_model: 505 | logging.info("%s: Flag 'start_new_model' is set. Building a new model.", 506 | task_as_string(self.task)) 507 | return None 508 | 509 | latest_checkpoint = tf.train.latest_checkpoint(train_dir) 510 | if not latest_checkpoint: 511 | logging.info("%s: No checkpoint file found. Building a new model.", 512 | task_as_string(self.task)) 513 | return None 514 | 515 | meta_filename = latest_checkpoint + ".meta" 516 | if not gfile.Exists(meta_filename): 517 | logging.info("%s: No meta graph file found. Building a new model.", 518 | task_as_string(self.task)) 519 | return None 520 | else: 521 | return meta_filename 522 | 523 | def recover_model(self, meta_filename): 524 | logging.info("%s: Restoring from meta graph file %s", 525 | task_as_string(self.task), meta_filename) 526 | return tf.train.import_meta_graph(meta_filename) 527 | 528 | def build_model(self, model, reader): 529 | """Find the model and build the graph.""" 530 | 531 | label_loss_fn = find_class_by_name(FLAGS.label_loss, [losses])() 532 | optimizer_class = find_class_by_name(FLAGS.optimizer, [tf.train]) 533 | 534 | build_graph(reader=reader, 535 | model=model, 536 | optimizer_class=optimizer_class, 537 | clip_gradient_norm=FLAGS.clip_gradient_norm, 538 | train_data_pattern=FLAGS.train_data_pattern, 539 | label_loss_fn=label_loss_fn, 540 | base_learning_rate=FLAGS.base_learning_rate, 541 | learning_rate_decay=FLAGS.learning_rate_decay, 542 | learning_rate_decay_examples=FLAGS.learning_rate_decay_examples, 543 | regularization_penalty=FLAGS.regularization_penalty, 544 | num_readers=FLAGS.num_readers, 545 | batch_size=FLAGS.batch_size, 546 | num_epochs=FLAGS.num_epochs) 547 | 548 | return tf.train.Saver(max_to_keep=0, keep_checkpoint_every_n_hours=0.25) 549 | 550 | 551 | def get_reader(): 552 | reader = readers.MnistReader() 553 | return reader 554 | 555 | 556 | class ParameterServer(object): 557 | """A parameter server to serve variables in a distributed execution.""" 558 | 559 | def __init__(self, cluster, task): 560 | """Creates a ParameterServer. 561 | 562 | Args: 563 | cluster: A tf.train.ClusterSpec if the execution is distributed. 564 | None otherwise. 565 | task: A TaskSpec describing the job type and the task index. 566 | """ 567 | 568 | self.cluster = cluster 569 | self.task = task 570 | 571 | def run(self): 572 | """Starts the parameter server.""" 573 | 574 | logging.info("%s: Starting parameter server within cluster %s.", 575 | task_as_string(self.task), self.cluster.as_dict()) 576 | server = start_server(self.cluster, self.task) 577 | server.join() 578 | 579 | 580 | def start_server(cluster, task): 581 | """Creates a Server. 582 | 583 | Args: 584 | cluster: A tf.train.ClusterSpec if the execution is distributed. 585 | None otherwise. 586 | task: A TaskSpec describing the job type and the task index. 587 | """ 588 | 589 | if not task.type: 590 | raise ValueError("%s: The task type must be specified." % 591 | task_as_string(task)) 592 | if task.index is None: 593 | raise ValueError("%s: The task index must be specified." % 594 | task_as_string(task)) 595 | 596 | # Create and start a server. 597 | return tf.train.Server( 598 | tf.train.ClusterSpec(cluster), 599 | protocol="grpc", 600 | job_name=task.type, 601 | task_index=task.index) 602 | 603 | def task_as_string(task): 604 | return "/job:%s/task:%s" % (task.type, task.index) 605 | 606 | def main(unused_argv): 607 | # Load the environment. 608 | env = json.loads(os.environ.get("TF_CONFIG", "{}")) 609 | 610 | # Load the cluster data from the environment. 611 | cluster_data = env.get("cluster", None) 612 | cluster = tf.train.ClusterSpec(cluster_data) if cluster_data else None 613 | 614 | # Load the task data from the environment. 615 | task_data = env.get("task", None) or {"type": "master", "index": 0} 616 | task = type("TaskSpec", (object,), task_data) 617 | 618 | # Logging the version. 619 | logging.set_verbosity(tf.logging.INFO) 620 | logging.info("%s: Tensorflow version: %s.", 621 | task_as_string(task), tf.__version__) 622 | 623 | # Dispatch to a master, a worker, or a parameter server. 624 | if not cluster or task.type == "master" or task.type == "worker": 625 | model = find_class_by_name(FLAGS.model, 626 | [mnist_models])() 627 | 628 | reader = get_reader() 629 | model_exporter = export_model.ModelExporter( 630 | model=model, 631 | reader=reader) 632 | 633 | Trainer(cluster, task, FLAGS.train_dir, model, reader, model_exporter, 634 | FLAGS.log_device_placement, FLAGS.max_steps, 635 | FLAGS.export_model_steps).run(start_new_model=FLAGS.start_new_model) 636 | 637 | elif task.type == "ps": 638 | ParameterServer(cluster, task).run() 639 | else: 640 | raise ValueError("%s: Invalid task_type: %s." % 641 | (task_as_string(task), task.type)) 642 | 643 | if __name__ == "__main__": 644 | app.run() 645 | -------------------------------------------------------------------------------- /mnist/utils.py: -------------------------------------------------------------------------------- 1 | # Copyright 2016 Google Inc. All Rights Reserved. 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # http://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS-IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | 15 | """Contains a collection of util functions for training and evaluating. 16 | """ 17 | 18 | import numpy 19 | import tensorflow as tf 20 | from tensorflow import logging 21 | 22 | 23 | def Dequantize(feat_vector, max_quantized_value=2, min_quantized_value=-2): 24 | """Dequantize the feature from the byte format to the float format. 25 | 26 | Args: 27 | feat_vector: the input 1-d vector. 28 | max_quantized_value: the maximum of the quantized value. 29 | min_quantized_value: the minimum of the quantized value. 30 | 31 | Returns: 32 | A float vector which has the same shape as feat_vector. 33 | """ 34 | assert max_quantized_value > min_quantized_value 35 | quantized_range = max_quantized_value - min_quantized_value 36 | scalar = quantized_range / 255.0 37 | bias = (quantized_range / 512.0) + min_quantized_value 38 | return feat_vector * scalar + bias 39 | 40 | 41 | def MakeSummary(name, value): 42 | """Creates a tf.Summary proto with the given name and value.""" 43 | summary = tf.Summary() 44 | val = summary.value.add() 45 | val.tag = str(name) 46 | val.simple_value = float(value) 47 | return summary 48 | 49 | 50 | def AddGlobalStepSummary(summary_writer, 51 | global_step_val, 52 | global_step_info_dict, 53 | summary_scope="Eval"): 54 | """Add the global_step summary to the Tensorboard. 55 | 56 | Args: 57 | summary_writer: Tensorflow summary_writer. 58 | global_step_val: a int value of the global step. 59 | global_step_info_dict: a dictionary of the evaluation metrics calculated for 60 | a mini-batch. 61 | summary_scope: Train or Eval. 62 | 63 | Returns: 64 | A string of this global_step summary 65 | """ 66 | this_hit_at_one = global_step_info_dict["hit_at_one"] 67 | this_perr = global_step_info_dict["perr"] 68 | this_loss = global_step_info_dict["loss"] 69 | examples_per_second = global_step_info_dict.get("examples_per_second", -1) 70 | 71 | summary_writer.add_summary( 72 | MakeSummary("GlobalStep/" + summary_scope + "_Hit@1", this_hit_at_one), 73 | global_step_val) 74 | summary_writer.add_summary( 75 | MakeSummary("GlobalStep/" + summary_scope + "_Perr", this_perr), 76 | global_step_val) 77 | summary_writer.add_summary( 78 | MakeSummary("GlobalStep/" + summary_scope + "_Loss", this_loss), 79 | global_step_val) 80 | 81 | if examples_per_second != -1: 82 | summary_writer.add_summary( 83 | MakeSummary("GlobalStep/" + summary_scope + "_Example_Second", 84 | examples_per_second), global_step_val) 85 | 86 | summary_writer.flush() 87 | info = ("global_step {0} | Batch Hit@1: {1:.3f} | Batch PERR: {2:.3f} | Batch Loss: {3:.3f} " 88 | "| Examples_per_sec: {4:.3f}").format( 89 | global_step_val, this_hit_at_one, this_perr, this_loss, 90 | examples_per_second) 91 | return info 92 | 93 | 94 | def AddEpochSummary(summary_writer, 95 | global_step_val, 96 | epoch_info_dict, 97 | summary_scope="Eval"): 98 | """Add the epoch summary to the Tensorboard. 99 | 100 | Args: 101 | summary_writer: Tensorflow summary_writer. 102 | global_step_val: a int value of the global step. 103 | epoch_info_dict: a dictionary of the evaluation metrics calculated for the 104 | whole epoch. 105 | summary_scope: Train or Eval. 106 | 107 | Returns: 108 | A string of this global_step summary 109 | """ 110 | epoch_id = epoch_info_dict["epoch_id"] 111 | avg_hit_at_one = epoch_info_dict["avg_hit_at_one"] 112 | avg_perr = epoch_info_dict["avg_perr"] 113 | avg_loss = epoch_info_dict["avg_loss"] 114 | aps = epoch_info_dict["aps"] 115 | gap = epoch_info_dict["gap"] 116 | mean_ap = numpy.mean(aps) 117 | 118 | summary_writer.add_summary( 119 | MakeSummary("Epoch/" + summary_scope + "_Avg_Hit@1", avg_hit_at_one), 120 | global_step_val) 121 | summary_writer.add_summary( 122 | MakeSummary("Epoch/" + summary_scope + "_Avg_Perr", avg_perr), 123 | global_step_val) 124 | summary_writer.add_summary( 125 | MakeSummary("Epoch/" + summary_scope + "_Avg_Loss", avg_loss), 126 | global_step_val) 127 | summary_writer.add_summary( 128 | MakeSummary("Epoch/" + summary_scope + "_MAP", mean_ap), 129 | global_step_val) 130 | summary_writer.add_summary( 131 | MakeSummary("Epoch/" + summary_scope + "_GAP", gap), 132 | global_step_val) 133 | summary_writer.flush() 134 | 135 | info = ("epoch/eval number {0} | Avg_Hit@1: {1:.3f} | Avg_PERR: {2:.3f} " 136 | "| MAP: {3:.3f} | GAP: {4:.3f} | Avg_Loss: {5:3f}").format( 137 | epoch_id, avg_hit_at_one, avg_perr, mean_ap, gap, avg_loss) 138 | return info 139 | 140 | def GetListOfFeatureNamesAndSizes(feature_names, feature_sizes): 141 | """Extract the list of feature names and the dimensionality of each feature 142 | from string of comma separated values. 143 | 144 | Args: 145 | feature_names: string containing comma separated list of feature names 146 | feature_sizes: string containing comma separated list of feature sizes 147 | 148 | Returns: 149 | List of the feature names and list of the dimensionality of each feature. 150 | Elements in the first/second list are strings/integers. 151 | """ 152 | list_of_feature_names = [ 153 | feature_names.strip() for feature_names in feature_names.split(',')] 154 | list_of_feature_sizes = [ 155 | int(feature_sizes) for feature_sizes in feature_sizes.split(',')] 156 | if len(list_of_feature_names) != len(list_of_feature_sizes): 157 | logging.error("length of the feature names (=" + 158 | str(len(list_of_feature_names)) + ") != length of feature " 159 | "sizes (=" + str(len(list_of_feature_sizes)) + ")") 160 | 161 | return list_of_feature_names, list_of_feature_sizes 162 | 163 | def clip_gradient_norms(gradients_to_variables, max_norm): 164 | """Clips the gradients by the given value. 165 | 166 | Args: 167 | gradients_to_variables: A list of gradient to variable pairs (tuples). 168 | max_norm: the maximum norm value. 169 | 170 | Returns: 171 | A list of clipped gradient to variable pairs. 172 | """ 173 | clipped_grads_and_vars = [] 174 | for grad, var in gradients_to_variables: 175 | if grad is not None: 176 | if isinstance(grad, tf.IndexedSlices): 177 | tmp = tf.clip_by_norm(grad.values, max_norm) 178 | grad = tf.IndexedSlices(tmp, grad.indices, grad.dense_shape) 179 | else: 180 | grad = tf.clip_by_norm(grad, max_norm) 181 | clipped_grads_and_vars.append((grad, var)) 182 | return clipped_grads_and_vars 183 | 184 | def combine_gradients(tower_grads): 185 | """Calculate the combined gradient for each shared variable across all towers. 186 | 187 | Note that this function provides a synchronization point across all towers. 188 | 189 | Args: 190 | tower_grads: List of lists of (gradient, variable) tuples. The outer list 191 | is over individual gradients. The inner list is over the gradient 192 | calculation for each tower. 193 | Returns: 194 | List of pairs of (gradient, variable) where the gradient has been summed 195 | across all towers. 196 | """ 197 | filtered_grads = [[x for x in grad_list if x[0] is not None] for grad_list in tower_grads] 198 | final_grads = [] 199 | for i in xrange(len(filtered_grads[0])): 200 | grads = [filtered_grads[t][i] for t in xrange(len(filtered_grads))] 201 | grad = tf.stack([x[0] for x in grads], 0) 202 | grad = tf.reduce_sum(grad, 0) 203 | final_grads.append((grad, filtered_grads[0][i][1],)) 204 | 205 | return final_grads 206 | --------------------------------------------------------------------------------