├── README.md
└── mnist
    ├── README.en.md
    ├── README.md
    ├── __init__.py
    ├── average_precision_calculator.py
    ├── cloudml-4gpu.yaml
    ├── cloudml-gpu-distributed.yaml
    ├── cloudml-gpu.yaml
    ├── convert_prediction_from_json_to_csv.py
    ├── eval.py
    ├── eval_util.py
    ├── export_model.py
    ├── inference.py
    ├── losses.py
    ├── mean_average_precision_calculator.py
    ├── mnist_models.py
    ├── models.py
    ├── readers.py
    ├── train.py
    └── utils.py


/README.md:
--------------------------------------------------------------------------------
1 | # Tutorial repository for Machine Learning Challenge 2017
2 | 
3 | This page contains the tutorial project for Machine Learning Challenge 2017.
4 | 


--------------------------------------------------------------------------------
/mnist/README.en.md:
--------------------------------------------------------------------------------
  1 | # MNIST Tensorflow Starter Code
  2 | 
  3 | 한국어 버전 설명은 다음 링크를 따라가주세요: [Link](README.md)
  4 | 
  5 | This repo contains starter code for training and evaluating machine learning
  6 | models over the mnist dataset.
  7 | 
  8 | The code gives an end-to-end working example for reading the dataset, training a
  9 | TensorFlow model, and evaluating the performance of the model. Out of the box,
 10 | you can train several [model architectures](#overview-of-models) over features.
 11 | The code can easily be extended to train your own custom-defined models.
 12 | 
 13 | It is possible to train and evaluate on mnist in two ways: on Google Cloud
 14 | or on your own machine. This README provides instructions for both.
 15 | 
 16 | ## Table of Contents
 17 | * [Running on Google's Cloud Machine Learning Platform](#running-on-googles-cloud-machine-learning-platform)
 18 |    * [Requirements](#requirements)
 19 |    * [Testing Locally](#testing-locally)
 20 |    * [Training on the Cloud](#training-on-features)
 21 |    * [Evaluation and Inference](#evaluation-and-inference)
 22 |    * [Inference Using Batch Prediction](#inference-using-batch-prediction)
 23 |    * [Accessing Files on Google Cloud](#accessing-files-on-google-cloud)
 24 |    * [Using Larger Machine Types](#using-larger-machine-types)
 25 | * [Running on Your Own Machine](#running-on-your-own-machine)
 26 |    * [Requirements](#requirements-1)
 27 | * [Overview of Models](#overview-of-models)
 28 | * [Overview of Files](#overview-of-files)
 29 |    * [Training](#training)
 30 |    * [Evaluation](#evaluation)
 31 |    * [Inference](#inference)
 32 |    * [Misc](#misc)
 33 | * [TODO for participants](#todo-for-participants)
 34 | * [Etc](#etc)
 35 | * [About This Project](#about-this-project)
 36 | 
 37 | ## Running on Google's Cloud Machine Learning Platform
 38 | 
 39 | ### Requirements
 40 | 
 41 | This option requires you to have an appropriately configured Google Cloud
 42 | Platform account. To create and configure your account, please make sure you
 43 | follow the instructions [here](https://cloud.google.com/ml/docs/how-tos/getting-set-up).
 44 | 
 45 | Please also verify that you have Python 2.7+ and Tensorflow 1.0.1 or higher
 46 | installed by running the following commands:
 47 | 
 48 | ```sh
 49 | python --version
 50 | python -c 'import tensorflow as tf; print(tf.__version__)'
 51 | ```
 52 | 
 53 | ### Testing Locally
 54 | All gcloud commands should be done from the directory *immediately above* the
 55 | source code. You should be able to see the source code directory if you
 56 | run 'ls'.
 57 | 
 58 | As you are developing your own models, you will want to test them
 59 | quickly to flush out simple problems without having to submit them to the cloud.
 60 | You can use the `gcloud beta ml local` set of commands for that.
 61 | Here is an example command line:
 62 | 
 63 | ```sh
 64 | gcloud ml-engine local train \
 65 | --package-path=mnist --module-name=mnist.train -- \
 66 | --train_data_pattern='gs://kmlc_test_train_bucket/mnist/train.tfrecords' \
 67 | --train_dir=/tmp/kmlc_mnist_train --model=LogisticModel --start_new_model
 68 | ```
 69 | 
 70 | You might want to download the training data files to the current directory.
 71 | 
 72 | ```sh
 73 | gsutil cp gs://kmlc_test_train_bucket/mnist/train.tfrecords .
 74 | ```
 75 | 
 76 | Once you download the files, you can point the job to them using the
 77 | 'train_data_pattern' argument (i.e. instead of pointing to the "gs://..."
 78 | files, you point to the local files).
 79 | 
 80 | Once your model is working locally, you can scale up on the Cloud
 81 | which is described below.
 82 | 
 83 | ### Training on the Cloud
 84 | 
 85 | You'll use Google Cloud to access the training and test files. You'll also be given free credits to try out Google Cloud. Below are some step-by-step tutorials to set up and get data, and submit training/testing jobs to Google Cloud ML.
 86 | 
 87 | #### Set up your Google Cloud project
 88 | 
 89 | 1. Create a new Cloud Platform project. This is where your project lives.
 90 |    - Click Create Project and follow instructions.
 91 |    - Enable billing for your project. This links a billing method to your project. For a new account, you will already have $300 in trial credits within your default billing account.
 92 |    - [Enable the APIs](https://console.cloud.google.com/flows/enableapi?apiid=ml.googleapis.com,dataflow,compute_component,logging,storage_component,storage_api,bigquery) but ignore adding Credentials. This enables the set of Cloud APIs that are needed for Cloud ML functionality such as Cloud Logging to get your training logs. Other APIs include: Cloud Machine Learning, Dataflow, Compute Engine, Cloud Storage, Cloud Storage JSON, and BigQuery APIs.
 93 |    - You may have to select the newly-created project.
 94 |    - Expect this to take some time.
 95 |    - After the APIs are enabled, do not “Go to Credentials”
 96 | 2. Set up your environment using cloud shell
 97 |    - There are three paths to use Google Cloud for this competition: Cloud shell, local (Mac/Linux), & Docker. To start we recommend the cloud shell to avoid having to set up a local environment.
 98 |    - Before you click the cloud shell button, make sure that you have already selected your newly-created project (in the screenshot example, the project name is My First Project)
 99 |    - You can start a cloud shell by clicking on the cloud shell icon shown in the screenshot below.
100 |      <img src="https://codelabs.developers.google.com/codelabs/cpb100-cloud-sql/img/d91c4415ec90a651.png" />
101 |    - You should run all of the following commands inside of the cloud shell command line.
102 |    - The first step to setting up the environment is to configure the gcloud command-line tool to use your selected project, where [selected-project-id] is your project id, without the enclosing brackets.
103 |      ``` sh
104 |      gcloud config set project [selected-project-id]
105 |      ```
106 |    - Python version should be 2.7+
107 |      ``` sh
108 |      $ python --version
109 |      Python 2.7.9
110 |      ```
111 |    - Install the latest version of TensorFlow (1.2.1) with the following 2 command lines.
112 |      ```sh
113 |      pip download tensorflow
114 |      pip install --user -U tensorflow*.whl
115 |      ```
116 | 3. Verify the Google Cloud SDK Components
117 |    - List the models to verify that the command returns an empty list.
118 |    ```sh
119 |    gcloud ml-engine models list
120 |    ```
121 |    - The command will an empty list, and after you start creating models, you can see them listed by using this command.
122 | 
123 | #### Running training
124 | 
125 | The following commands will train a model on Google Cloud. Following commands are need to be executed on Google Cloud Shell.
126 | 
127 | ```sh
128 | BUCKET_NAME=gs://${USER}_kmlc_mnist_train_bucket
129 | # (One Time) Create a storage bucket to store training logs and checkpoints.
130 | gsutil mb -l us-east1 $BUCKET_NAME
131 | # Submit the training job.
132 | JOB_NAME=kmlc_mnist_train_$(date +%Y%m%d_%H%M%S); gcloud --verbosity=debug ml-engine jobs \
133 | submit training $JOB_NAME \
134 | --package-path=mnist --module-name=mnist.train \
135 | --staging-bucket=$BUCKET_NAME --region=us-east1 \
136 | --config=mnist/cloudml-gpu.yaml \
137 | -- --train_data_pattern='gs://kmlc_test_train_bucket/mnist/train.tfrecords' \
138 | --model=LogisticModel \
139 | --train_dir=$BUCKET_NAME/kmlc_mnist_train_logistic_model
140 | ```
141 | 
142 | In the 'gsutil' command above, the 'package-path' flag refers to the directory
143 | containing the 'train.py' script and more generally the python package which
144 | should be deployed to the cloud worker. The module-name refers to the specific
145 | python script which should be executed (in this case the train module).
146 | 
147 | It may take several minutes before the job starts running on Google Cloud.
148 | When it starts you will see outputs like the following:
149 | 
150 | ```
151 | training step 270| Hit@1: 0.68 PERR: 0.52 Loss: 638.453
152 | training step 271| Hit@1: 0.66 PERR: 0.49 Loss: 635.537
153 | training step 272| Hit@1: 0.70 PERR: 0.52 Loss: 637.564
154 | ```
155 | 
156 | At this point you can disconnect your console by pressing "ctrl-c". The
157 | model will continue to train indefinitely in the Cloud. Later, you can check
158 | on its progress or halt the job by visiting the
159 | [Google Cloud ML Jobs console](https://console.cloud.google.com/ml/jobs).
160 | 
161 | You can train many jobs at once and use tensorboard to compare their performance
162 | visually.
163 | 
164 | ```sh
165 | tensorboard --logdir=$BUCKET_NAME --port=8080
166 | ```
167 | 
168 | Once tensorboard is running, you can access it at the following url:
169 | [http://localhost:8080](http://localhost:8080).
170 | If you are using Google Cloud Shell, you can instead click the Web Preview button
171 | on the upper left corner of the Cloud Shell window and select "Preview on port 8080".
172 | This will bring up a new browser tab with the Tensorboard view.
173 | 
174 | ### Evaluation and Inference
175 | Here's how to evaluate a model on the validation dataset:
176 | 
177 | ```sh
178 | JOB_TO_EVAL=kmlc_mnist_train_logistic_model
179 | JOB_NAME=kmlc_mnist_eval_$(date +%Y%m%d_%H%M%S); gcloud --verbosity=debug ml-engine jobs \
180 | submit training $JOB_NAME \
181 | --package-path=mnist --module-name=mnist.eval \
182 | --staging-bucket=$BUCKET_NAME --region=us-east1 \
183 | --config=mnist/cloudml-gpu.yaml \
184 | -- --eval_data_pattern='gs://kmlc_test_train_bucket/mnist/validation.tfrecords' \
185 | --model=LogisticModel \
186 | --train_dir=$BUCKET_NAME/${JOB_TO_EVAL} --run_once=True
187 | ```
188 | 
189 | And here's how to perform inference with a model on the test set:
190 | 
191 | ```sh
192 | JOB_TO_EVAL=kmlc_mnist_train_logistic_model
193 | JOB_NAME=kmlc_mnist_inference_$(date +%Y%m%d_%H%M%S); gcloud --verbosity=debug ml-engine jobs \
194 | submit training $JOB_NAME \
195 | --package-path=mnist --module-name=mnist.inference \
196 | --staging-bucket=$BUCKET_NAME --region=us-east1 \
197 | --config=mnist/cloudml-gpu.yaml \
198 | -- --input_data_pattern='gs://kmlc_test_train_bucket/mnist/test.tfrecords' \
199 | --train_dir=$BUCKET_NAME/${JOB_TO_EVAL} \
200 | --output_file=$BUCKET_NAME/${JOB_TO_EVAL}/predictions.csv
201 | ```
202 | 
203 | Note the confusing use of 'training' in the above gcloud commands. Despite the
204 | name, the 'training' argument really just offers a cloud hosted
205 | python/tensorflow service. From the point of view of the Cloud Platform, there
206 | is no distinction between our training and inference jobs. The Cloud ML platform
207 | also offers specialized functionality for prediction with
208 | Tensorflow models, but discussing that is beyond the scope of this readme.
209 | 
210 | Once these job starts executing you will see outputs similar to the
211 | following for the evaluation code:
212 | 
213 | ```
214 | examples_processed: 1024 | global_step 447044 | Batch Hit@1: 0.782 | Batch PERR: 0.637 | Batch Loss: 7.821 | Examples_per_sec: 834.658
215 | ```
216 | 
217 | and the following for the inference code:
218 | 
219 | ```
220 | num examples processed: 8192 elapsed seconds: 14.85
221 | ```
222 | 
223 | ### Inference Using Batch Prediction
224 | To perform inference faster, you can also use the Cloud ML batch prediction
225 | service.
226 | 
227 | First, find the directory where the training job exported the model:
228 | 
229 | ```
230 | gsutil list ${BUCKET_NAME}/kmlc_mnist_train_logistic_model/export
231 | ```
232 | 
233 | You should see an output similar to this one:
234 | 
235 | ```
236 | ${BUCKET_NAME}/kmlc_mnist_train_logistic_model/export/
237 | ${BUCKET_NAME}/kmlc_mnist_train_logistic_model/export/step_1/
238 | ${BUCKET_NAME}/kmlc_mnist_train_logistic_model/export/step_1001/
239 | ${BUCKET_NAME}/kmlc_mnist_train_logistic_model/export/step_2001/
240 | ${BUCKET_NAME}/kmlc_mnist_train_logistic_model/export/step_3001/
241 | ```
242 | 
243 | Select the latest version of the model that was saved. For instance, in our
244 | case, we select the version of the model that was saved at step 3001:
245 | 
246 | ```
247 | EXPORTED_MODEL_DIR=${BUCKET_NAME}/kmlc_mnist_train_logistic_model/export/step_3001/
248 | ```
249 | 
250 | Start the batch prediction job using the following command:
251 | 
252 | ```
253 | JOB_NAME=kmlc_mnist_batch_predict_$(date +%Y%m%d_%H%M%S); \
254 | gcloud ml-engine jobs submit prediction ${JOB_NAME} --verbosity=debug \
255 | --model-dir=${EXPORTED_MODEL_DIR} --data-format=TF_RECORD \
256 | --input-paths='gs://kmlc_test_train_bucket/mnist/test.tfrecords' \
257 | --output-path=${BUCKET_NAME}/batch_predict/${JOB_NAME}.csv --region=us-east1 \
258 | --runtime-version=1.2 --max-worker-count=10
259 | ```
260 | 
261 | You can check the progress of the job on the
262 | [Google Cloud ML Jobs console](https://console.cloud.google.com/ml/jobs). To
263 | have the job complete faster, you can increase 'max-worker-count' to a
264 | higher value.
265 | 
266 | Once the batch prediction job has completed, turn its output into a submission
267 | in the CVS format by running the following commands:
268 | 
269 | ```
270 | # Copy the output of the batch prediction job to a local directory
271 | mkdir -p /tmp/batch_predict/${JOB_NAME}
272 | gsutil -m cp -r ${BUCKET_NAME}/batch_predict/${JOB_NAME}.csv/* /tmp/batch_predict/${JOB_NAME}.csv
273 | ```
274 | 
275 | Submit the resulting file /tmp/batch_predict/${JOB_NAME}.csv to [Kaggle](https://inclass.kaggle.com/c/mnist-tutorial-machine-learning-challenge).
276 | 
277 | ### Accessing Files on Google Cloud
278 | 
279 | You can browse the storage buckets you created on Google Cloud, for example, to
280 | access the trained models, prediction CSV files, etc. by visiting the
281 | [Google Cloud storage browser](https://console.cloud.google.com/storage/browser).
282 | 
283 | Alternatively, you can use the 'gsutil' command to download the files directly.
284 | For example, to download the output of the inference code from the previous
285 | section to your local machine, run:
286 | 
287 | 
288 | ```
289 | gsutil cp $BUCKET_NAME/${JOB_TO_EVAL}/predictions.csv .
290 | ```
291 | 
292 | 
293 | ### Using Larger Machine Types
294 | Some complex models can take as long as a week to converge when
295 | using only one GPU. You can train these models more quickly by using more
296 | powerful machine types which have additional GPUs. To use a configuration with
297 | 4 GPUs, replace the argument to `--config` with `mnist/cloudml-4gpu.yaml`.
298 | Be careful with this argument as it will also increase the rate you are charged
299 | by a factor of 4 as well.
300 | 
301 | ## Running on Your Own Machine
302 | 
303 | ### Requirements
304 | 
305 | The starter code requires Tensorflow. If you haven't installed it yet, follow
306 | the instructions on [tensorflow.org](https://www.tensorflow.org/install/).
307 | This code has been tested with Tensorflow 1.0.1. Going forward, we will continue
308 | to target the latest released version of Tensorflow.
309 | 
310 | Please verify that you have Python 2.7+ and Tensorflow 1.0.1 or higher
311 | installed by running the following commands:
312 | 
313 | ```sh
314 | python --version
315 | python -c 'import tensorflow as tf; print(tf.__version__)'
316 | ```
317 | 
318 | Downloading files
319 | ``` sh
320 | gsutil cp gs://kmlc_test_train_bucket/mnist/train* .
321 | gsutil cp gs://kmlc_test_train_bucket/mnist/test* .
322 | gsutil cp gs://kmlc_test_train_bucket/mnist/validation* .
323 | ```
324 | 
325 | Training
326 | ```sh
327 | python train.py --train_data_pattern='/path/to/training/files/*' --train_dir=/tmp/mnist_train --model=LogisticModel --start_new_model
328 | ```
329 | 
330 | Validation
331 | ```sh
332 | python eval.py --eval_data_pattern='/path/to/validation/files' --train_dir=/tmp/mnist_train --model=LogisticModel --run_once=True
333 | ```
334 | 
335 | Generating submission
336 | ```sh
337 | python inference.py --output_file=/path/to/predictions.csv --input_data_pattern='/path/to/test/files/*' --train_dir=/tmp/mnist_train
338 | ```
339 | 
340 | ## Overview of Models
341 | 
342 | This sample code contains implementation of the logistic model:
343 | 
344 | *   `LogisticModel`: Linear projection of the output features into the label
345 |                      space, followed by a sigmoid function to convert logit
346 |                      values to probabilities.
347 | 
348 | ## Overview of Files
349 | 
350 | ### Training
351 | *   `train.py`: Defines the parameters and procedures for training. You can modify the parameters such as the location of training dataset, the model to be used for training, the batch size, the loss function to be used, the learning rate, etc. Depending on the model, you may want to modify get_input_data_tensors() on how data is shuffled.
352 | *   `losses.py`: Defines the loss functions. You can call train.py to use any of the loss functions defined in losses.py.
353 | *   `models.py`: Contains the base class for defining a model.
354 | *   `mnist_models.py`: Contains the definition for models that that take the aggregated features as input, and you should add your own models here. You can invoke any model by calling train.py using --model=YourModelName.
355 | *   `export_model.py`: Provides a class to export a model during training for later use in batch prediction.
356 | *   `readers.py`: Contains the definition of the dataset, and describes how input data are prepared. You can preprocess the input files by modifying prepare_serialized_examples(). For example, you may want to resize the data or introduce some random noise.
357 | 
358 | ### Evaluation
359 | *   `eval.py`: The primary script for evaluating models. Once the model is trained, you can call eval.py with the --train_dir=/path/to/model and --model=YourModelName to load your model from the files in train_dir. Most likely you do not need to modify this file.
360 | *   `eval_util.py`: Provides a class that calculates all evaluation metrics.
361 | *   `average_precision_calculator.py`: Functions for calculating average precision.
362 | *   `mean_average_precision_calculator.py`: Functions for calculating mean average precision.
363 | 
364 | ### Inference
365 | *   `inference.py`: Generates an output file containing predictions of the model over a set of data. Call inference.py on the test data to generate a list of predicted labels. For the supervised learning problems, the evaluation is based on the accuracy of the predicted labels.
366 | 
367 | ### Misc
368 | *   `README.md`: This documentation.
369 | *   `utils.py`: Common functions.
370 | *   `convert_prediction_from_json_to_csv.py`: Converts the JSON output of batch prediction into a CSV file for submission.
371 | 
372 | ## TODO for participants
373 | * Create your model in mnist_models.py.
374 | * Create a loss function in losses.py.
375 | * If necessary, add input preprocessing in readers.py.
376 | * Adjust the parameters in train.py, such as the batch size and learning rate, and modify the training procedure if it is necessary.
377 | * Train your model by calling train.py using your model name and loss function.
378 | * Call eval.py to examine the performance of trained model on validation data.
379 | * Call inference.py to obtain the predicted labels using your model, and submit these labels in [Kaggle](https://inclass.kaggle.com/c/mnist-tutorial-machine-learning-challenge) for evaluation.
380 | * In general, you are free to modify any file if it improves the performance of your model.
381 | 
382 | ## Etc
383 | * [Tensorflow MNIST Tutorial](https://www.tensorflow.org/get_started/mnist/beginners)
384 | 
385 | ## About This Project
386 | This project is meant help people quickly get started working with the
387 | [mnist](https://inclass.kaggle.com/c/mnist-tutorial-machine-learning-challenge) dataset.
388 | This is not an official Google product.
389 | 


--------------------------------------------------------------------------------
/mnist/README.md:
--------------------------------------------------------------------------------
  1 | # MNIST Tensorflow Starter Code
  2 | 
  3 | For the English version description, please follow the link: [Link](README.en.md)
  4 | 
  5 | 본 저장소는 MNIST 문제를 해결하기 위한 트레이닝과 머신러닝 모델 평가를 위한 기본적인 코드를 담고 있습니다.
  6 | 
  7 | 본 저장소에 있는 코드는 데이터를 읽고, TensorFlow 모델을 훈련하고, 모델의 성능을 평가하는 일련의 과정에 대한 예제 코드입니다.
  8 | 여러분은 데이터를 이용해서 [model architectures](#overview-of-models)에 주어진 것과 같은 여러 모델들을 훈련할 수 있습니다.
  9 | 또한, 본 코드는 주어진 모델 외에도 여러분의 커스텀 모델에도 쉽게 적용될 수 있도록 디자인 되어 있습니다.
 10 | 
 11 | MNIST를 훈련시키고 평가하는 방법에는 두 가지가 있습니다: Google Cloud 위에서 작업하실 수도 있고, 여러분의 로컬 머신에서 작업하실 수도 있습니다.
 12 | 본 README는 양쪽 방법 모두를 다루고자 합니다.
 13 | 
 14 | ## Table of Contents
 15 | * [Running on Google's Cloud Machine Learning Platform](#running-on-googles-cloud-machine-learning-platform)
 16 |    * [Requirements](#requirements)
 17 |    * [Testing Locally](#testing-locally)
 18 |    * [Training on the Cloud](#training-on-features)
 19 |    * [Evaluation and Inference](#evaluation-and-inference)
 20 |    * [Inference Using Batch Prediction](#inference-using-batch-prediction)
 21 |    * [Accessing Files on Google Cloud](#accessing-files-on-google-cloud)
 22 |    * [Using Larger Machine Types](#using-larger-machine-types)
 23 | * [Running on Your Own Machine](#running-on-your-own-machine)
 24 |    * [Requirements](#requirements-1)
 25 | * [Overview of Models](#overview-of-models)
 26 | * [Overview of Files](#overview-of-files)
 27 |    * [Training](#training)
 28 |    * [Evaluation](#evaluation)
 29 |    * [Inference](#inference)
 30 |    * [Misc](#misc)
 31 | * [TODO for paticiapnts](#todo-for-participants)
 32 | * [Etc](#etc)
 33 | * [About This Project](#about-this-project)
 34 | 
 35 | ## Running on Google's Cloud Machine Learning Platform
 36 | 
 37 | ### Requirements
 38 | 
 39 | 먼저 여러분의 계정이 Google Cloud Platform을 사용할 수 있도록 세팅되어야 합니다.
 40 | 본 [링크](https://cloud.google.com/ml/docs/how-tos/getting-set-up)를 따라서 계정 생성 및 설정을 진행해주세요.
 41 | 
 42 | 또한, 다음 명령들을 실행하여 Python 2.7 이상, 그리고 TensorFlow 1.0.1 이상 버전이 설치되어 있는지 확인하시기 바랍니다:
 43 | 
 44 | ```sh
 45 | python --version
 46 | python -c 'import tensorflow as tf; print(tf.__version__)'
 47 | ```
 48 | 
 49 | ### Testing Locally
 50 | 다음 gcloud 명령들은 모두 *저장소 루트* 경로에서 실행되어야 합니다.
 51 | 현재 디렉토리의 위치는 명령어 'ls'를 통하여 확인하실 수 있습니다.
 52 | 
 53 | 모델 개발 도중, 간단한 이슈들을 확인하기 위해 클라우드에 올릴 필요 없이 바로 로컬에서 테스트 해 보실 수 있습니다.
 54 | `gcloud beta ml local` 와 같은 명령어를 통해 사용하시면 되며, 예제는 다음과 같습니다:
 55 | 
 56 | ```sh
 57 | gcloud ml-engine local train \
 58 | --package-path=mnist --module-name=mnist.train -- \
 59 | --train_data_pattern='gs://kmlc_test_train_bucket/mnist/train.tfrecords' \
 60 | --train_dir=/tmp/kmlc_mnist_train --model=LogisticModel --start_new_model
 61 | ```
 62 | 
 63 | 훈련 데이터를 현 디렉토리에 받는 법은 다음과 같습니다.
 64 | 
 65 | ```sh
 66 | gsutil cp gs://kmlc_test_train_bucket/mnist/train.tfrecords .
 67 | ```
 68 | 
 69 | 파일을 다운로드 받고 나면, `train_data_pattern` 인자를 통하여 job에 해당 파일을 입력으로 사용할 수 있습니다.("gs://..." 경로를 통하여 원격에 있는 파일이 아닌, 로컬 파일을 지정할 수 있습니다.)
 70 | 
 71 | 모델이 로컬에서 동작하는 것을 확인하셨다면, 아래 쓰인 방법을 통하여 클라우드에서도 시도해 보실 차례입니다.
 72 | 
 73 | ### Training on the Cloud
 74 | 
 75 | 이번 항목에서는 훈련 데이터와 테스트 파일들에 접근하기 위해 Google Cloud를 사용할 것입니다. 과거에 해당 계정이서 Google Cloud를 사용하신 적이 없다면 $300 상당의 Credit이 주어지게 됩니다. 다음 스텝들을 따라 가시면 프로젝트 초기화, 데이터 다운로드 및 Google Cloud ML상에서의 훈련 / 테스팅 등을 경험하실 수 있습니다.
 76 | 
 77 | #### Set up your Google Cloud project
 78 | 
 79 | [Link](https://mlchallenge2017.com/references-kr.html)에 쓰인 프로젝트 생성 및 개발환경 설정을 따라주시기 바랍니다.
 80 | 
 81 | #### Running training
 82 | 
 83 | 다음 명령어는 해당 모델을 Google Cloud 위에서 훈련시킬 것입니다. 다음 명령어들은 Google Cloud Shell위에서 실행되어야 합니다.
 84 | 
 85 | ```sh
 86 | BUCKET_NAME=gs://${USER}_kmlc_mnist_train_bucket
 87 | # (One Time) Create a storage bucket to store training logs and checkpoints.
 88 | gsutil mb -l us-east1 $BUCKET_NAME
 89 | # Submit the training job.
 90 | JOB_NAME=kmlc_mnist_train_$(date +%Y%m%d_%H%M%S); gcloud --verbosity=debug ml-engine jobs \
 91 | submit training $JOB_NAME \
 92 | --package-path=mnist --module-name=mnist.train \
 93 | --staging-bucket=$BUCKET_NAME --region=us-east1 \
 94 | --config=mnist/cloudml-gpu.yaml \
 95 | -- --train_data_pattern='gs://kmlc_test_train_bucket/mnist/train.tfrecords' \
 96 | --model=LogisticModel \
 97 | --train_dir=$BUCKET_NAME/kmlc_mnist_train_logistic_model
 98 | ```
 99 | 
100 | 위 `gsutil` 명령어에서 'package-path' 플래그는 'train.py' 스크립트를 포함하고 있는 경로를 의미하며, 동시에 Cloud worker에 업로드 될 패키지를 의미하기도 합니다. 'module-name'은 실행되어야 할 파이선 스크립트를 지정하는 플래그입니다.(본 예제에서는 train module을 사용하고 있습니다.)
101 | 
102 | Google Cloud에서 업로드 후 job들이 실행되기까지는 잠시 시간이 걸립니다.
103 | 실행이 되고 나면 다음과 같은 메세지들을 보실 수 있습니다:
104 | 
105 | ```
106 | training step 270| Hit@1: 0.68 PERR: 0.52 Loss: 638.453
107 | training step 271| Hit@1: 0.66 PERR: 0.49 Loss: 635.537
108 | training step 272| Hit@1: 0.70 PERR: 0.52 Loss: 637.564
109 | ```
110 | 
111 | 작업 도중 "ctrl-c" 를 눌러 콘솔에서 접속을 끊을 수 있습니다. 모델은 클라우드 상에서 독립적으로 계속 훈련이 진행되며, 해당 job에 대한 진행 상황을 확인하거나 멈추는 등의 작업은 [Google Cloud ML Jobs console](https://console.cloud.google.com/ml/jobs) 를 통해서 하실 수 있습니다.
112 | 
113 | 여러 job들을 동시에 훈련시킬 수도 있으며, tensorboard를 통해서 모델들의 퍼포먼스를 시각화하여 보실 수 있습니다.
114 | 
115 | ```sh
116 | tensorboard --logdir=$BUCKET_NAME --port=8080
117 | ```
118 | 
119 | Tensorboard가 실행되고 나면 다음 명령어를 통해서 Tensorboard를 보실 수 있습니다: [http://localhost:8080](http://localhost:8080)
120 | 만약 Google Cloud Shell에서 실행하셨다면 콘솔 위쪽에 있는 Web Preview 버튼을 누르고 "Preview on port 8080" 메뉴를 통해서 Tensorboard를 보실 수 있습니다.
121 | 
122 | ### Evaluation and Inference
123 | 다음은 모델을 Validation dataset위에서 평가하는 방법입니다:
124 | 
125 | ```sh
126 | JOB_TO_EVAL=kmlc_mnist_train_logistic_model
127 | JOB_NAME=kmlc_mnist_eval_$(date +%Y%m%d_%H%M%S); gcloud --verbosity=debug ml-engine jobs \
128 | submit training $JOB_NAME \
129 | --package-path=mnist --module-name=mnist.eval \
130 | --staging-bucket=$BUCKET_NAME --region=us-east1 \
131 | --config=mnist/cloudml-gpu.yaml \
132 | -- --eval_data_pattern='gs://kmlc_test_train_bucket/mnist/validation.tfrecords' \
133 | --model=LogisticModel \
134 | --train_dir=$BUCKET_NAME/${JOB_TO_EVAL} --run_once=True
135 | ```
136 | 
137 | 다음은 모델을 Test set위에서 실행하는 방법입니다:
138 | 
139 | ```sh
140 | JOB_TO_EVAL=kmlc_mnist_train_logistic_model
141 | JOB_NAME=kmlc_mnist_inference_$(date +%Y%m%d_%H%M%S); gcloud --verbosity=debug ml-engine jobs \
142 | submit training $JOB_NAME \
143 | --package-path=mnist --module-name=mnist.inference \
144 | --staging-bucket=$BUCKET_NAME --region=us-east1 \
145 | --config=mnist/cloudml-gpu.yaml \
146 | -- --input_data_pattern='gs://kmlc_test_train_bucket/mnist/test.tfrecords' \
147 | --train_dir=$BUCKET_NAME/${JOB_TO_EVAL} \
148 | --output_file=$BUCKET_NAME/${JOB_TO_EVAL}/predictions.csv
149 | ```
150 | 
151 | 위 gcloud 명령어에서 'training'이라는 부분이 다소 착각의 여지가 있습니다. 이름과는 다르게, 'training'이란 명령어는 클라우드가 호스팅하는 Python/Tensorflow 서비스를 제공하는 일을 합니다. Cloud Platform의 관점에서 보면 training과 inference는 아무 차이가 없는 일이기 때문입니다. Cloud ML Platform은 Tensorflow를 통한 예측을 위한 특별한 기능들을 제공하기는 하나 본 문서에서는 다루지 않겠습니다.
152 | 
153 | Job들이 시작되고 나면 Evaluation 코드에 대해서는 다음과 같은 메세지를 보시게 됩니다:
154 | 
155 | ```
156 | examples_processed: 1024 | global_step 447044 | Batch Hit@1: 0.782 | Batch PERR: 0.637 | Batch Loss: 7.821 | Examples_per_sec: 834.658
157 | ```
158 | 
159 | Inference 코드에 대해서는 다음과 같은 메세지를 보시게 됩니다:
160 | 
161 | ```
162 | num examples processed: 8192 elapsed seconds: 14.85
163 | ```
164 | 
165 | ### Inference Using Batch Prediction
166 | Inference를 좀 더 빠르게 진행하기 위하여 Cloud ML Batch Prediction을 사용하실 수 있습니다.
167 | 
168 | 먼저, Training job이 모델을 내보내는 디렉토리를 찾아주세요:
169 | 
170 | ```
171 | gsutil list ${BUCKET_NAME}/kmlc_mnist_train_logistic_model/export
172 | ```
173 | 
174 | 다음과 비슷한 출력을 보시게 될 겁니다:
175 | 
176 | ```
177 | ${BUCKET_NAME}/kmlc_mnist_train_logistic_model/export/
178 | ${BUCKET_NAME}/kmlc_mnist_train_logistic_model/export/step_1/
179 | ${BUCKET_NAME}/kmlc_mnist_train_logistic_model/export/step_1001/
180 | ${BUCKET_NAME}/kmlc_mnist_train_logistic_model/export/step_2001/
181 | ${BUCKET_NAME}/kmlc_mnist_train_logistic_model/export/step_3001/
182 | ```
183 | 
184 | 가장 최근에 저장된 모델을 선택하세요. 우리는 3001번째 step에 저장된 모델을 가지고 진행하도록 하겠습니다:
185 | 
186 | ```
187 | EXPORTED_MODEL_DIR=${BUCKET_NAME}/kmlc_mnist_train_logistic_model/export/step_3001/
188 | ```
189 | 
190 | 다음 명령어를 사용하여 Batch Prediction을 시작하세요:
191 | 
192 | ```
193 | JOB_NAME=kmlc_mnist_batch_predict_$(date +%Y%m%d_%H%M%S); \
194 | gcloud ml-engine jobs submit prediction ${JOB_NAME} --verbosity=debug \
195 | --model-dir=${EXPORTED_MODEL_DIR} --data-format=TF_RECORD \
196 | --input-paths='gs://kmlc_test_train_bucket/mnist/test.tfrecords' \
197 | --output-path=${BUCKET_NAME}/batch_predict/${JOB_NAME}.csv --region=us-east1 \
198 | --runtime-version=1.2 --max-worker-count=10
199 | ```
200 | 
201 | 이후 진척 상황은 [Google Cloud ML Jobs console](https://console.cloud.google.com/ml/jobs)에서 확인하실 수 있습니다. Job을 좀 더 빠르게 진행하기를 원하시면  'max-worker-count'를 증가시키시면 됩니다.
202 | 
203 | Batch Prediction Job이 끝나면 다음 명령어를 통해서 결과물을 CSV 형식으로 저장해주세요:
204 | 
205 | ```
206 | # Copy the output of the batch prediction job to a local directory
207 | mkdir -p /tmp/batch_predict/${JOB_NAME}
208 | gsutil -m cp -r ${BUCKET_NAME}/batch_predict/${JOB_NAME}.csv /tmp/batch_predict/${JOB_NAME}.csv
209 | ```
210 | 
211 | /tmp/batch_predict/${JOB_NAME}.csv 를 [Kaggle](https://inclass.kaggle.com/c/mnist-tutorial-machine-learning-challenge)에 제출해주세요.
212 | 
213 | 
214 | ### Accessing Files on Google Cloud
215 | 
216 | [Google Cloud storage browser](https://console.cloud.google.com/storage/browser)을 통해서 이전에 생성한 storage bucket을 직접 조회할 수 있습니다. 해당 버킷에 저장된 Trained model, CSV 파일 등을 직접 조회할 수 있습니다.
217 | 
218 | 다른 방법으로는 'gsutil' 명령어를 통해서 파일을 직접 다운로드 받을 수도 있습니다.
219 | 예를 들어 방금 생성한 Prediction CSV를 로컬 머신으로 다운로드 받고자 한다면 다음 명령어를 실행하시면 됩니다:
220 | 
221 | ```
222 | gsutil cp $BUCKET_NAME/${JOB_TO_EVAL}/predictions.csv .
223 | ```
224 | 
225 | ### Using Larger Machine Types
226 | 몇몇 복잡한 모델들은 단일 GPU로 훈련시 모델 수렴까지 일주일 이상이 걸리기도 합니다. 이러한 모델들은 좀 더 많은 GPU가 있는 강력한 머신으로 좀 더 빠르게 훈련이 가능합니다. 4개의 GPU가 있는 머신을 사용하하기 위해서는, '--config' 플래그를 'mnist/cloudml-4gpu.yaml'로 변경해주세요. 이 플래그는 소모하는 Cloud credit 역시 4배로 늘어난다는 점 주의하시기 바랍니다.
227 | 
228 | ## Running on Your Own Machine
229 | 
230 | ### Requirements
231 | 
232 | 시작하기 전 Tensorflow가 준비되어 있어야 합니다. 만약 아직 설치하지 않으셨다면 [tensorflow.org](https://www.tensorflow.org/install/)에 쓰인 설명을 따라주세요.
233 | 
234 | 다음 명령어를 통해 Python 2.7 이상, 그리고 Tensorflow 1.0.1 이상이 설치되어 있는지 확인하시기 바랍니다:
235 | 
236 | ```sh
237 | python --version
238 | python -c 'import tensorflow as tf; print(tf.__version__)'
239 | ```
240 | 
241 | Downloading files
242 | ``` sh
243 | gsutil cp gs://kmlc_test_train_bucket/mnist/train* .
244 | gsutil cp gs://kmlc_test_train_bucket/mnist/test* .
245 | gsutil cp gs://kmlc_test_train_bucket/mnist/validation* .
246 | ```
247 | 
248 | Training
249 | ```sh
250 | python train.py --train_data_pattern='/path/to/training/files/*' --train_dir=/tmp/mnist_train --model=LogisticModel --start_new_model
251 | ```
252 | 
253 | Validation
254 | ```sh
255 | python eval.py --eval_data_pattern='/path/to/validation/files' --train_dir=/tmp/mnist_train --model=LogisticModel --run_once=True
256 | ```
257 | 
258 | Generating submission
259 | ```sh
260 | python inference.py --output_file=/path/to/predictions.csv --input_data_pattern='/path/to/test/files/*' --train_dir=/tmp/mnist_train
261 | ```
262 | 
263 | ## Overview of Models
264 | 
265 | 예제 코드는 Logistic model의 구현체를 담고 있습니다:
266 | 
267 | *   `[LogisticModel](https://ko.wikipedia.org/wiki/%EB%A1%9C%EC%A7%80%EC%8A%A4%ED%8B%B1_%ED%9A%8C%EA%B7%80)`: 독립 변수의 선형 결합을 이용하여 사건의 발생 가능성을 예측하는데 사용되는 통계 기법입니다.
268 | 
269 | ## Overview of Files
270 | 
271 | ### Training
272 | *   `train.py`: 훈련을 위한 전달인자와 과정을 정의합니다. 변경 가능한 전달인자로는 훈련 데이터셋의 위치, 훈련에 사용될 모델, 배치 사이즈, 로스 함수, 학습 레이트 등이 있습니다. 모델에 따라 get_input_data_tensors()를 변경하여 데이터가 셔플되는 과정을 수정하실 수도 있습니다.
273 | *   `losses.py`: 로스 함수를 정의합니다. losses.py에 정의된 어떤 로스 함수도 train.py에서 사용하실 수 있습니다.
274 | *   `models.py`: 모델을 정의하기 위한 Base class를 포함하고 있습니다.
275 | *   `mnist_models.py`: 인풋에서 관찰해야 하는 특성들을 입력으로 받을 수 있는 모델에 대한 정의가 있습니다. 여러분은 여러분만의 모델 또한 이 곳에 정의하셔야 합니다. train.py를 호출할 떄 인자로 --model=YourModelName 을 전달하여 여러분의 모델을 부를 수 있습니다.
276 | *   `export_model.py`: 배치 예측 작업에서 쓰일 모델을 추출하기 위한 클래스를 제공합니다.
277 | *   `readers.py`: 데이터에 대한 정의와 인풋 데이터가 어떤 식으로 준비되어야 하는지를 표기하고 있습니다. prepare_serialized_examples()를 변경함으로써 인풋 파일을 프리프로세스 할 수도 있습니다. 예를 들면 데이터를 리사이즈하거나 랜덤 노이즈를 섞을 수 있습니다.
278 | 
279 | ### Evaluation
280 | *   `eval.py`: 모델을 평가하기 위한 기본적인 스크립트입니다. 모델이 훈련되고 나면 --train_dir=/path/to/model 과 --model=YourModelName 인자와 함께 eval.py를 호출하여 train_dir 경로 안에 있는 모델을 불러올 수 있습니다. 보통은 이 파일을 변경할 필요는 없습니다.
281 | *   `eval_util.py`: 모든 종류의 평가 메트릭들을 계산하는 클래스를 제공합니다.
282 | *   `average_precision_calculator.py`: Average precision을 계산하는 함수들을 제공합니다.
283 | *   `mean_average_precision_calculator.py`: Mean average precision을 계산하는 함수들을 제공합니다.
284 | 
285 | ### Inference
286 | *   `inference.py`: inference.py를 호출하여 테스트 데이터에 대한 예측 레이블들을 생성하세요. Supervised learning 문제에서 채점은 Predicted label의 정확도에 기반하여 진행됩니다.
287 | 
288 | ### Misc
289 | *   `README.md`: 이 문서입니다.
290 | *   `utils.py`: 일반적인 유틸리티 함수입니다.
291 | *   `convert_prediction_from_json_to_csv.py`: 배치 prediction의 JSON 출력 파일을 CSV로 변경합니다.
292 | 
293 | ## TODO for participants
294 | * mnist_models.py 안에 여러분만의 모델을 생성하세요.
295 | * losses.py 안에 로스 함수를 생성하세요.
296 | * 필요하다면, readers.py 안에 입력 프리프로세싱을 추가해주세요
297 | * train.py 파일 안에서 batch size나 learning rate와 같은 parameter들을 조절하고, 필요하다면 training procedure도 수정해주세요.
298 | * 여러분의 모델 이름과 로스 함수를 이용해서 train.py를 호출하여 모델을 훈련하세요.
299 | * eval.py를 호출하여 검증 데이터 위에서 훈련된 모델의 성능을 검증하세요.
300 | * inference.py를 통하여 훈련된 모델을 테스트 데이터에 적용하고, 결과 예측 레이블 정보를 채점을 위해 [Kaggle](https://inclass.kaggle.com/c/mnist-tutorial-machine-learning-challenge)에 제출하세요.
301 | * 본 시나리오는 MLC에 참가하기 위한 가장 쉬운 시나리오 중 하나이며, 여러분만의 솔루션과 더 나은 성능을 위해서라면 어떤 파일을 수정하셔도 무방합니다.
302 | 
303 | ## Etc
304 | * [Tensorflow MNIST Tutorial](https://www.tensorflow.org/get_started/mnist/beginners)
305 | 
306 | ## About This Project
307 | This project is meant help people quickly get started working with the
308 | [mnist](https://inclass.kaggle.com/c/mnist-tutorial-machine-learning-challenge) dataset.
309 | This is not an official Google product.
310 | 


--------------------------------------------------------------------------------
/mnist/__init__.py:
--------------------------------------------------------------------------------
 1 | # Copyright 2016 Google Inc. All Rights Reserved.
 2 | #
 3 | # Licensed under the Apache License, Version 2.0 (the "License");
 4 | # you may not use this file except in compliance with the License.
 5 | # You may obtain a copy of the License at
 6 | #
 7 | #      http://www.apache.org/licenses/LICENSE-2.0
 8 | #
 9 | # Unless required by applicable law or agreed to in writing, software
10 | # distributed under the License is distributed on an "AS-IS" BASIS,
11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12 | # See the License for the specific language governing permissions and
13 | # limitations under the License.
14 | 


--------------------------------------------------------------------------------
/mnist/average_precision_calculator.py:
--------------------------------------------------------------------------------
  1 | # Copyright 2016 Google Inc. All Rights Reserved.
  2 | #
  3 | # Licensed under the Apache License, Version 2.0 (the "License");
  4 | # you may not use this file except in compliance with the License.
  5 | # You may obtain a copy of the License at
  6 | #
  7 | #      http://www.apache.org/licenses/LICENSE-2.0
  8 | #
  9 | # Unless required by applicable law or agreed to in writing, software
 10 | # distributed under the License is distributed on an "AS-IS" BASIS,
 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 12 | # See the License for the specific language governing permissions and
 13 | # limitations under the License.
 14 | 
 15 | """Calculate or keep track of the interpolated average precision.
 16 | 
 17 | It provides an interface for calculating interpolated average precision for an
 18 | entire list or the top-n ranked items. For the definition of the
 19 | (non-)interpolated average precision:
 20 | http://trec.nist.gov/pubs/trec15/appendices/CE.MEASURES06.pdf
 21 | 
 22 | Example usages:
 23 | 1) Use it as a static function call to directly calculate average precision for
 24 | a short ranked list in the memory.
 25 | 
 26 | ```
 27 | import random
 28 | 
 29 | p = np.array([random.random() for _ in xrange(10)])
 30 | a = np.array([random.choice([0, 1]) for _ in xrange(10)])
 31 | 
 32 | ap = average_precision_calculator.AveragePrecisionCalculator.ap(p, a)
 33 | ```
 34 | 
 35 | 2) Use it as an object for long ranked list that cannot be stored in memory or
 36 | the case where partial predictions can be observed at a time (Tensorflow
 37 | predictions). In this case, we first call the function accumulate many times
 38 | to process parts of the ranked list. After processing all the parts, we call
 39 | peek_interpolated_ap_at_n.
 40 | ```
 41 | p1 = np.array([random.random() for _ in xrange(5)])
 42 | a1 = np.array([random.choice([0, 1]) for _ in xrange(5)])
 43 | p2 = np.array([random.random() for _ in xrange(5)])
 44 | a2 = np.array([random.choice([0, 1]) for _ in xrange(5)])
 45 | 
 46 | # interpolated average precision at 10 using 1000 break points
 47 | calculator = average_precision_calculator.AveragePrecisionCalculator(10)
 48 | calculator.accumulate(p1, a1)
 49 | calculator.accumulate(p2, a2)
 50 | ap3 = calculator.peek_ap_at_n()
 51 | ```
 52 | """
 53 | 
 54 | import heapq
 55 | import random
 56 | import numbers
 57 | 
 58 | import numpy
 59 | 
 60 | 
 61 | class AveragePrecisionCalculator(object):
 62 |   """Calculate the average precision and average precision at n."""
 63 | 
 64 |   def __init__(self, top_n=None):
 65 |     """Construct an AveragePrecisionCalculator to calculate average precision.
 66 | 
 67 |     This class is used to calculate the average precision for a single label.
 68 | 
 69 |     Args:
 70 |       top_n: A positive Integer specifying the average precision at n, or
 71 |         None to use all provided data points.
 72 | 
 73 |     Raises:
 74 |       ValueError: An error occurred when the top_n is not a positive integer.
 75 |     """
 76 |     if not ((isinstance(top_n, int) and top_n >= 0) or top_n is None):
 77 |       raise ValueError("top_n must be a positive integer or None.")
 78 | 
 79 |     self._top_n = top_n  # average precision at n
 80 |     self._total_positives = 0  # total number of positives have seen
 81 |     self._heap = []  # max heap of (prediction, actual)
 82 | 
 83 |   @property
 84 |   def heap_size(self):
 85 |     """Gets the heap size maintained in the class."""
 86 |     return len(self._heap)
 87 | 
 88 |   @property
 89 |   def num_accumulated_positives(self):
 90 |     """Gets the number of positive samples that have been accumulated."""
 91 |     return self._total_positives
 92 | 
 93 |   def accumulate(self, predictions, actuals, num_positives=None):
 94 |     """Accumulate the predictions and their ground truth labels.
 95 | 
 96 |     After the function call, we may call peek_ap_at_n to actually calculate
 97 |     the average precision.
 98 |     Note predictions and actuals must have the same shape.
 99 | 
100 |     Args:
101 |       predictions: a list storing the prediction scores.
102 |       actuals: a list storing the ground truth labels. Any value
103 |       larger than 0 will be treated as positives, otherwise as negatives.
104 |       num_positives = If the 'predictions' and 'actuals' inputs aren't complete,
105 |       then it's possible some true positives were missed in them. In that case,
106 |       you can provide 'num_positives' in order to accurately track recall.
107 | 
108 |     Raises:
109 |       ValueError: An error occurred when the format of the input is not the
110 |       numpy 1-D array or the shape of predictions and actuals does not match.
111 |     """
112 |     if len(predictions) != len(actuals):
113 |       raise ValueError("the shape of predictions and actuals does not match.")
114 | 
115 |     if not num_positives is None:
116 |       if not isinstance(num_positives, numbers.Number) or num_positives < 0:
117 |         raise ValueError("'num_positives' was provided but it wan't a nonzero number.")
118 | 
119 |     if not num_positives is None:
120 |       self._total_positives += num_positives
121 |     else:
122 |       self._total_positives += numpy.size(numpy.where(actuals > 0))
123 |     topk = self._top_n
124 |     heap = self._heap
125 | 
126 |     for i in range(numpy.size(predictions)):
127 |       if topk is None or len(heap) < topk:
128 |         heapq.heappush(heap, (predictions[i], actuals[i]))
129 |       else:
130 |         if predictions[i] > heap[0][0]:  # heap[0] is the smallest
131 |           heapq.heappop(heap)
132 |           heapq.heappush(heap, (predictions[i], actuals[i]))
133 | 
134 |   def clear(self):
135 |     """Clear the accumulated predictions."""
136 |     self._heap = []
137 |     self._total_positives = 0
138 | 
139 |   def peek_ap_at_n(self):
140 |     """Peek the non-interpolated average precision at n.
141 | 
142 |     Returns:
143 |       The non-interpolated average precision at n (default 0).
144 |       If n is larger than the length of the ranked list,
145 |       the average precision will be returned.
146 |     """
147 |     if self.heap_size <= 0:
148 |       return 0
149 |     predlists = numpy.array(list(zip(*self._heap)))
150 | 
151 |     ap = self.ap_at_n(predlists[0],
152 |                       predlists[1],
153 |                       n=self._top_n,
154 |                       total_num_positives=self._total_positives)
155 |     return ap
156 | 
157 |   @staticmethod
158 |   def ap(predictions, actuals):
159 |     """Calculate the non-interpolated average precision.
160 | 
161 |     Args:
162 |       predictions: a numpy 1-D array storing the sparse prediction scores.
163 |       actuals: a numpy 1-D array storing the ground truth labels. Any value
164 |       larger than 0 will be treated as positives, otherwise as negatives.
165 | 
166 |     Returns:
167 |       The non-interpolated average precision at n.
168 |       If n is larger than the length of the ranked list,
169 |       the average precision will be returned.
170 | 
171 |     Raises:
172 |       ValueError: An error occurred when the format of the input is not the
173 |       numpy 1-D array or the shape of predictions and actuals does not match.
174 |     """
175 |     return AveragePrecisionCalculator.ap_at_n(predictions,
176 |                                               actuals,
177 |                                               n=None)
178 | 
179 |   @staticmethod
180 |   def ap_at_n(predictions, actuals, n=20, total_num_positives=None):
181 |     """Calculate the non-interpolated average precision.
182 | 
183 |     Args:
184 |       predictions: a numpy 1-D array storing the sparse prediction scores.
185 |       actuals: a numpy 1-D array storing the ground truth labels. Any value
186 |       larger than 0 will be treated as positives, otherwise as negatives.
187 |       n: the top n items to be considered in ap@n.
188 |       total_num_positives : (optionally) you can specify the number of total
189 |         positive
190 |       in the list. If specified, it will be used in calculation.
191 | 
192 |     Returns:
193 |       The non-interpolated average precision at n.
194 |       If n is larger than the length of the ranked list,
195 |       the average precision will be returned.
196 | 
197 |     Raises:
198 |       ValueError: An error occurred when
199 |       1) the format of the input is not the numpy 1-D array;
200 |       2) the shape of predictions and actuals does not match;
201 |       3) the input n is not a positive integer.
202 |     """
203 |     if len(predictions) != len(actuals):
204 |       raise ValueError("the shape of predictions and actuals does not match.")
205 | 
206 |     if n is not None:
207 |       if not isinstance(n, int) or n <= 0:
208 |         raise ValueError("n must be 'None' or a positive integer."
209 |                          " It was '%s'." % n)
210 | 
211 |     ap = 0.0
212 | 
213 |     predictions = numpy.array(predictions)
214 |     actuals = numpy.array(actuals)
215 | 
216 |     # add a shuffler to avoid overestimating the ap
217 |     predictions, actuals = AveragePrecisionCalculator._shuffle(predictions,
218 |                                                                actuals)
219 |     sortidx = sorted(
220 |         range(len(predictions)),
221 |         key=lambda k: predictions[k],
222 |         reverse=True)
223 | 
224 |     if total_num_positives is None:
225 |       numpos = numpy.size(numpy.where(actuals > 0))
226 |     else:
227 |       numpos = total_num_positives
228 | 
229 |     if numpos == 0:
230 |       return 0
231 | 
232 |     if n is not None:
233 |       numpos = min(numpos, n)
234 |     delta_recall = 1.0 / numpos
235 |     poscount = 0.0
236 | 
237 |     # calculate the ap
238 |     r = len(sortidx)
239 |     if n is not None:
240 |       r = min(r, n)
241 |     for i in range(r):
242 |       if actuals[sortidx[i]] > 0:
243 |         poscount += 1
244 |         ap += poscount / (i + 1) * delta_recall
245 |     return ap
246 | 
247 |   @staticmethod
248 |   def _shuffle(predictions, actuals):
249 |     random.seed(0)
250 |     suffidx = random.sample(range(len(predictions)), len(predictions))
251 |     predictions = predictions[suffidx]
252 |     actuals = actuals[suffidx]
253 |     return predictions, actuals
254 | 
255 |   @staticmethod
256 |   def _zero_one_normalize(predictions, epsilon=1e-7):
257 |     """Normalize the predictions to the range between 0.0 and 1.0.
258 | 
259 |     For some predictions like SVM predictions, we need to normalize them before
260 |     calculate the interpolated average precision. The normalization will not
261 |     change the rank in the original list and thus won't change the average
262 |     precision.
263 | 
264 |     Args:
265 |       predictions: a numpy 1-D array storing the sparse prediction scores.
266 |       epsilon: a small constant to avoid denominator being zero.
267 | 
268 |     Returns:
269 |       The normalized prediction.
270 |     """
271 |     denominator = numpy.max(predictions) - numpy.min(predictions)
272 |     ret = (predictions - numpy.min(predictions)) / numpy.max(denominator,
273 |                                                              epsilon)
274 |     return ret
275 | 


--------------------------------------------------------------------------------
/mnist/cloudml-4gpu.yaml:
--------------------------------------------------------------------------------
1 | trainingInput:
2 |   scaleTier: CUSTOM
3 |   masterType: complex_model_m_gpu
4 |   runtimeVersion: "1.2"
5 | 


--------------------------------------------------------------------------------
/mnist/cloudml-gpu-distributed.yaml:
--------------------------------------------------------------------------------
1 | trainingInput:
2 |   runtimeVersion: "1.2"
3 |   scaleTier: CUSTOM
4 |   masterType: standard_gpu
5 |   workerCount: 2
6 |   workerType: standard_gpu
7 |   parameterServerCount: 2
8 |   parameterServerType: standard
9 | 


--------------------------------------------------------------------------------
/mnist/cloudml-gpu.yaml:
--------------------------------------------------------------------------------
1 | trainingInput:
2 |   scaleTier: CUSTOM
3 |   masterType: standard_gpu
4 |   runtimeVersion: "1.2"
5 | 


--------------------------------------------------------------------------------
/mnist/convert_prediction_from_json_to_csv.py:
--------------------------------------------------------------------------------
  1 | # Copyright 2016 Google Inc. All Rights Reserved.
  2 | #
  3 | # Licensed under the Apache License, Version 2.0 (the "License");
  4 | # you may not use this file except in compliance with the License.
  5 | # You may obtain a copy of the License at
  6 | #
  7 | #      http://www.apache.org/licenses/LICENSE-2.0
  8 | #
  9 | # Unless required by applicable law or agreed to in writing, software
 10 | # distributed under the License is distributed on an "AS-IS" BASIS,
 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 12 | # See the License for the specific language governing permissions and
 13 | # limitations under the License.
 14 | 
 15 | """Utility to convert the output of batch prediction into a CSV submission.
 16 | 
 17 | It converts the JSON files created by the command
 18 | 'gcloud beta ml jobs submit prediction' into a CSV file ready for submission.
 19 | """
 20 | 
 21 | import json
 22 | import tensorflow as tf
 23 | 
 24 | from builtins import range
 25 | from tensorflow import app
 26 | from tensorflow import flags
 27 | from tensorflow import gfile
 28 | from tensorflow import logging
 29 | 
 30 | 
 31 | FLAGS = flags.FLAGS
 32 | 
 33 | if __name__ == '__main__':
 34 | 
 35 |   flags.DEFINE_string(
 36 |       "json_prediction_files_pattern", None,
 37 |       "Pattern specifying the list of JSON files that the command "
 38 |       "'gcloud beta ml jobs submit prediction' outputs. These files are "
 39 |       "located in the output path of the prediction command and are prefixed "
 40 |       "with 'prediction.results'.")
 41 |   flags.DEFINE_string(
 42 |       "csv_output_file", None,
 43 |       "The file to save the predictions converted to the CSV format.")
 44 | 
 45 | 
 46 | def get_csv_header():
 47 |   return "Id,Category\n"
 48 | 
 49 | def to_csv_row(json_data):
 50 | 
 51 |   video_id = json_data["video_id"]
 52 | 
 53 |   class_indexes = json_data["class_indexes"]
 54 |   predictions = json_data["predictions"]
 55 | 
 56 |   if isinstance(video_id, list):
 57 |     video_id = video_id[0]
 58 |     class_indexes = class_indexes[0]
 59 |     predictions = predictions[0]
 60 | 
 61 |   if len(class_indexes) != len(predictions):
 62 |     raise ValueError(
 63 |         "The number of indexes (%s) and predictions (%s) must be equal." 
 64 |         % (len(class_indexes), len(predictions)))
 65 | 
 66 |   return (video_id.decode('utf-8') + "," + " ".join("%i %f" % 
 67 |       (class_indexes[i], predictions[i]) 
 68 |       for i in range(len(class_indexes))) + "\n")
 69 | 
 70 | def main(unused_argv):
 71 |   logging.set_verbosity(tf.logging.INFO)
 72 | 
 73 |   if not FLAGS.json_prediction_files_pattern:
 74 |     raise ValueError(
 75 |         "The flag --json_prediction_files_pattern must be specified.")
 76 | 
 77 |   if not FLAGS.csv_output_file:
 78 |     raise ValueError("The flag --csv_output_file must be specified.")
 79 | 
 80 |   logging.info("Looking for prediction files with pattern: %s", 
 81 |                FLAGS.json_prediction_files_pattern)
 82 | 
 83 |   file_paths = gfile.Glob(FLAGS.json_prediction_files_pattern)  
 84 |   logging.info("Found files: %s", file_paths)
 85 | 
 86 |   logging.info("Writing submission file to: %s", FLAGS.csv_output_file)
 87 |   with gfile.Open(FLAGS.csv_output_file, "w+") as output_file:
 88 |     output_file.write(get_csv_header())
 89 | 
 90 |     for file_path in file_paths:
 91 |       logging.info("processing file: %s", file_path)
 92 | 
 93 |       with gfile.Open(file_path) as input_file:
 94 | 
 95 |         for line in input_file: 
 96 |           json_data = json.loads(line)
 97 |           output_file.write(to_csv_row(json_data))
 98 | 
 99 |     output_file.flush()
100 |   logging.info("done")
101 | 
102 | if __name__ == "__main__":
103 |   app.run()
104 | 


--------------------------------------------------------------------------------
/mnist/eval.py:
--------------------------------------------------------------------------------
  1 | # Copyright 2016 Google Inc. All Rights Reserved.
  2 | #
  3 | # Licensed under the Apache License, Version 2.0 (the "License");
  4 | # you may not use this file except in compliance with the License.
  5 | # You may obtain a copy of the License at
  6 | #
  7 | #      http://www.apache.org/licenses/LICENSE-2.0
  8 | #
  9 | # Unless required by applicable law or agreed to in writing, software
 10 | # distributed under the License is distributed on an "AS-IS" BASIS,
 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 12 | # See the License for the specific language governing permissions and
 13 | # limitations under the License.
 14 | """Binary for evaluating Tensorflow models on the YouTube-8M dataset."""
 15 | 
 16 | import time
 17 | 
 18 | import eval_util
 19 | import mnist_models
 20 | import losses
 21 | import readers
 22 | import tensorflow as tf
 23 | from tensorflow import app
 24 | from tensorflow import flags
 25 | from tensorflow import gfile
 26 | from tensorflow import logging
 27 | import utils
 28 | 
 29 | FLAGS = flags.FLAGS
 30 | 
 31 | if __name__ == "__main__":
 32 |   # Dataset flags.
 33 |   flags.DEFINE_string("train_dir", "/tmp/mnist_model/",
 34 |                       "The directory to load the model files from. "
 35 |                       "The tensorboard metrics files are also saved to this "
 36 |                       "directory.")
 37 |   flags.DEFINE_string(
 38 |       "eval_data_pattern", "",
 39 |       "File glob defining the evaluation dataset in tensorflow.SequenceExample "
 40 |       "format. The SequenceExamples are expected to have an 'rgb' byte array "
 41 |       "sequence feature as well as a 'labels' int64 context feature.")
 42 | 
 43 |   flags.DEFINE_string(
 44 |       "model", "LogisticModel", "Which Model to use: see mnist_models.py")
 45 | 
 46 |   flags.DEFINE_integer("batch_size", 512,
 47 |                        "How many examples to process per batch.")
 48 | 
 49 |   flags.DEFINE_string("label_loss", "CrossEntropyLoss",
 50 |                       "Loss computed on validation data")
 51 | 
 52 |   # Other flags.
 53 |   flags.DEFINE_integer("num_readers", 8,
 54 |                        "How many threads to use for reading input files.")
 55 | 
 56 |   flags.DEFINE_boolean("run_once", False, "Whether to run eval only once.")
 57 | 
 58 | 
 59 | def find_class_by_name(name, modules):
 60 |   """Searches the provided modules for the named class and returns it."""
 61 |   modules = [getattr(module, name, None) for module in modules]
 62 |   return next(a for a in modules if a)
 63 | 
 64 | 
 65 | def get_input_evaluation_tensors(reader,
 66 |                                  data_pattern,
 67 |                                  batch_size=1024,
 68 |                                  num_readers=1):
 69 |   """Creates the section of the graph which reads the evaluation data.
 70 | 
 71 |   Args:
 72 |     reader: A class which parses the training data.
 73 |     data_pattern: A 'glob' style path to the data files.
 74 |     batch_size: How many examples to process at a time.
 75 |     num_readers: How many I/O threads to use.
 76 | 
 77 |   Returns:
 78 |     A tuple containing the ids, images, and labels
 79 | 
 80 |   Raises:
 81 |     IOError: If no files matching the given pattern were found.
 82 |   """
 83 |   logging.info("Using batch size of " + str(batch_size) + " for evaluation.")
 84 |   with tf.name_scope("eval_input"):
 85 |     files = gfile.Glob(data_pattern)
 86 |     if not files:
 87 |       raise IOError("Unable to find the evaluation files.")
 88 |     logging.info("number of evaluation files: " + str(len(files)))
 89 |     filename_queue = tf.train.string_input_producer(
 90 |         files, shuffle=False, num_epochs=1)
 91 |     eval_data = [
 92 |         reader.prepare_reader(filename_queue) for _ in range(num_readers)
 93 |     ]
 94 |     return tf.train.batch_join(
 95 |         eval_data,
 96 |         batch_size=batch_size,
 97 |         capacity=3 * batch_size,
 98 |         allow_smaller_final_batch=True,
 99 |         enqueue_many=True)
100 | 
101 | 
102 | def build_graph(reader,
103 |                 model,
104 |                 eval_data_pattern,
105 |                 label_loss_fn,
106 |                 batch_size=1024,
107 |                 num_readers=1):
108 |   """Creates the Tensorflow graph for evaluation.
109 | 
110 |   Args:
111 |     reader: The data file reader. It should inherit from BaseReader.
112 |     model: The core model (e.g. logistic or neural net). It should inherit
113 |            from BaseModel.
114 |     eval_data_pattern: glob path to the evaluation data files.
115 |     label_loss_fn: What kind of loss to apply to the model. It should inherit
116 |                 from BaseLoss.
117 |     batch_size: How many examples to process at a time.
118 |     num_readers: How many threads to use for I/O operations.
119 |   """
120 | 
121 |   global_step = tf.Variable(0, trainable=False, name="global_step")
122 |   model_input_raw, labels_batch = get_input_evaluation_tensors(  # pylint: disable=g-line-too-long
123 |       reader,
124 |       eval_data_pattern,
125 |       batch_size=batch_size)
126 |   tf.summary.histogram("model_input_raw", model_input_raw)
127 | 
128 |   model_input = model_input_raw
129 | 
130 |   with tf.variable_scope("tower"):
131 |     result = model.create_model(model_input,
132 |                                 vocab_size=reader.num_classes,
133 |                                 labels=labels_batch,
134 |                                 is_training=False)
135 |     predictions = result["predictions"]
136 |     tf.summary.histogram("model_activations", predictions)
137 |     if "loss" in result.keys():
138 |       label_loss = result["loss"]
139 |     else:
140 |       label_loss = label_loss_fn.calculate_loss(predictions, labels_batch)
141 | 
142 |   tf.add_to_collection("global_step", global_step)
143 |   tf.add_to_collection("loss", label_loss)
144 |   tf.add_to_collection("predictions", predictions)
145 |   tf.add_to_collection("input_batch", model_input)
146 |   tf.add_to_collection("labels", tf.cast(labels_batch, tf.float32))
147 |   tf.add_to_collection("summary_op", tf.summary.merge_all())
148 | 
149 | 
150 | def evaluation_loop(prediction_batch, label_batch, loss,
151 |                     summary_op, saver, summary_writer, evl_metrics,
152 |                     last_global_step_val):
153 |   """Run the evaluation loop once.
154 | 
155 |   Args:
156 |     prediction_batch: a tensor of predictions mini-batch.
157 |     label_batch: a tensor of label_batch mini-batch.
158 |     loss: a tensor of loss for the examples in the mini-batch.
159 |     summary_op: a tensor which runs the tensorboard summary operations.
160 |     saver: a tensorflow saver to restore the model.
161 |     summary_writer: a tensorflow summary_writer
162 |     evl_metrics: an EvaluationMetrics object.
163 |     last_global_step_val: the global step used in the previous evaluation.
164 | 
165 |   Returns:
166 |     The global_step used in the latest model.
167 |   """
168 | 
169 |   global_step_val = -1
170 |   with tf.Session() as sess:
171 |     latest_checkpoint = tf.train.latest_checkpoint(FLAGS.train_dir)
172 |     if latest_checkpoint:
173 |       logging.info("Loading checkpoint for eval: " + latest_checkpoint)
174 |       # Restores from checkpoint
175 |       saver.restore(sess, latest_checkpoint)
176 |       # Assuming model_checkpoint_path looks something like:
177 |       # /my-favorite-path/yt8m_train/model.ckpt-0, extract global_step from it.
178 |       global_step_val = latest_checkpoint.split("/")[-1].split("-")[-1]
179 |     else:
180 |       logging.info("No checkpoint file found.")
181 |       return global_step_val
182 | 
183 |     if global_step_val == last_global_step_val:
184 |       logging.info("skip this checkpoint global_step_val=%s "
185 |                    "(same as the previous one).", global_step_val)
186 |       return global_step_val
187 | 
188 |     sess.run([tf.local_variables_initializer()])
189 | 
190 |     # Start the queue runners.
191 |     fetches = [prediction_batch, label_batch, loss, summary_op]
192 |     coord = tf.train.Coordinator()
193 |     try:
194 |       threads = []
195 |       for qr in tf.get_collection(tf.GraphKeys.QUEUE_RUNNERS):
196 |         threads.extend(qr.create_threads(
197 |             sess, coord=coord, daemon=True,
198 |             start=True))
199 |       logging.info("enter eval_once loop global_step_val = %s. ",
200 |                    global_step_val)
201 | 
202 |       evl_metrics.clear()
203 | 
204 |       examples_processed = 0
205 |       while not coord.should_stop():
206 |         batch_start_time = time.time()
207 |         predictions_val, labels_val, loss_val, summary_val = sess.run(
208 |             fetches)
209 |         seconds_per_batch = time.time() - batch_start_time
210 |         example_per_second = labels_val.shape[0] / seconds_per_batch
211 |         examples_processed += labels_val.shape[0]
212 | 
213 |         iteration_info_dict = evl_metrics.accumulate(predictions_val,
214 |                                                      labels_val, loss_val)
215 |         iteration_info_dict["examples_per_second"] = example_per_second
216 | 
217 |         iterinfo = utils.AddGlobalStepSummary(
218 |             summary_writer,
219 |             global_step_val,
220 |             iteration_info_dict,
221 |             summary_scope="Eval")
222 |         logging.info("examples_processed: %d | %s", examples_processed,
223 |                      iterinfo)
224 | 
225 |     except tf.errors.OutOfRangeError as e:
226 |       logging.info(
227 |           "Done with batched inference. Now calculating global performance "
228 |           "metrics.")
229 |       # calculate the metrics for the entire epoch
230 |       epoch_info_dict = evl_metrics.get()
231 |       epoch_info_dict["epoch_id"] = global_step_val
232 | 
233 |       summary_writer.add_summary(summary_val, global_step_val)
234 |       epochinfo = utils.AddEpochSummary(
235 |           summary_writer,
236 |           global_step_val,
237 |           epoch_info_dict,
238 |           summary_scope="Eval")
239 |       logging.info(epochinfo)
240 |       evl_metrics.clear()
241 |     except Exception as e:  # pylint: disable=broad-except
242 |       logging.info("Unexpected exception: " + str(e))
243 |       coord.request_stop(e)
244 | 
245 |     coord.request_stop()
246 |     coord.join(threads, stop_grace_period_secs=10)
247 | 
248 |     return global_step_val
249 | 
250 | 
251 | def evaluate():
252 |   tf.set_random_seed(0)  # for reproducibility
253 |   with tf.Graph().as_default():
254 |     reader = readers.MnistReader()
255 | 
256 |     model = find_class_by_name(FLAGS.model,
257 |         [mnist_models])()
258 |     label_loss_fn = find_class_by_name(FLAGS.label_loss, [losses])()
259 | 
260 |     if FLAGS.eval_data_pattern is "":
261 |       raise IOError("'eval_data_pattern' was not specified. " +
262 |                      "Nothing to evaluate.")
263 | 
264 |     build_graph(
265 |         reader=reader,
266 |         model=model,
267 |         eval_data_pattern=FLAGS.eval_data_pattern,
268 |         label_loss_fn=label_loss_fn,
269 |         num_readers=FLAGS.num_readers,
270 |         batch_size=FLAGS.batch_size)
271 |     logging.info("built evaluation graph")
272 |     prediction_batch = tf.get_collection("predictions")[0]
273 |     label_batch = tf.get_collection("labels")[0]
274 |     loss = tf.get_collection("loss")[0]
275 |     summary_op = tf.get_collection("summary_op")[0]
276 | 
277 |     saver = tf.train.Saver(tf.global_variables())
278 |     summary_writer = tf.summary.FileWriter(
279 |         FLAGS.train_dir, graph=tf.get_default_graph())
280 | 
281 |     evl_metrics = eval_util.EvaluationMetrics(reader.num_classes, 2)
282 | 
283 |     last_global_step_val = -1
284 |     while True:
285 |       last_global_step_val = evaluation_loop(prediction_batch,
286 |                                              label_batch, loss, summary_op,
287 |                                              saver, summary_writer, evl_metrics,
288 |                                              last_global_step_val)
289 |       if FLAGS.run_once:
290 |         break
291 | 
292 | 
293 | def main(unused_argv):
294 |   logging.set_verbosity(tf.logging.INFO)
295 |   print("tensorflow version: %s" % tf.__version__)
296 |   evaluate()
297 | 
298 | 
299 | if __name__ == "__main__":
300 |   app.run()
301 | 


--------------------------------------------------------------------------------
/mnist/eval_util.py:
--------------------------------------------------------------------------------
  1 | # Copyright 2016 Google Inc. All Rights Reserved.
  2 | #
  3 | # Licensed under the Apache License, Version 2.0 (the "License");
  4 | # you may not use this file except in compliance with the License.
  5 | # You may obtain a copy of the License at
  6 | #
  7 | #      http://www.apache.org/licenses/LICENSE-2.0
  8 | #
  9 | # Unless required by applicable law or agreed to in writing, software
 10 | # distributed under the License is distributed on an "AS-IS" BASIS,
 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 12 | # See the License for the specific language governing permissions and
 13 | # limitations under the License.
 14 | 
 15 | """Provides functions to help with evaluating models."""
 16 | import datetime
 17 | import numpy
 18 | 
 19 | from tensorflow.python.platform import gfile
 20 | 
 21 | import mean_average_precision_calculator as map_calculator
 22 | import average_precision_calculator as ap_calculator
 23 | 
 24 | def flatten(l):
 25 |   """ Merges a list of lists into a single list. """
 26 |   return [item for sublist in l for item in sublist]
 27 | 
 28 | def calculate_hit_at_one(predictions, actuals):
 29 |   """Performs a local (numpy) calculation of the hit at one.
 30 | 
 31 |   Args:
 32 |     predictions: Matrix containing the outputs of the model.
 33 |       Dimensions are 'batch' x 'num_classes'.
 34 |     actuals: Matrix containing the ground truth labels.
 35 |       Dimensions are 'batch' x 'num_classes'.
 36 | 
 37 |   Returns:
 38 |     float: The average hit at one across the entire batch.
 39 |   """
 40 |   top_prediction = numpy.argmax(predictions, 1)
 41 |   hits = actuals[numpy.arange(actuals.shape[0]), top_prediction]
 42 |   return numpy.average(hits)
 43 | 
 44 | 
 45 | def calculate_precision_at_equal_recall_rate(predictions, actuals):
 46 |   """Performs a local (numpy) calculation of the PERR.
 47 | 
 48 |   Args:
 49 |     predictions: Matrix containing the outputs of the model.
 50 |       Dimensions are 'batch' x 'num_classes'.
 51 |     actuals: Matrix containing the ground truth labels.
 52 |       Dimensions are 'batch' x 'num_classes'.
 53 | 
 54 |   Returns:
 55 |     float: The average precision at equal recall rate across the entire batch.
 56 |   """
 57 |   aggregated_precision = 0.0
 58 |   num_videos = actuals.shape[0]
 59 |   for row in numpy.arange(num_videos):
 60 |     num_labels = int(numpy.sum(actuals[row]))
 61 |     top_indices = numpy.argpartition(predictions[row],
 62 |                                      -num_labels)[-num_labels:]
 63 |     item_precision = 0.0
 64 |     for label_index in top_indices:
 65 |       if predictions[row][label_index] > 0:
 66 |         item_precision += actuals[row][label_index]
 67 |     item_precision /= top_indices.size
 68 |     aggregated_precision += item_precision
 69 |   aggregated_precision /= num_videos
 70 |   return aggregated_precision
 71 | 
 72 | def calculate_gap(predictions, actuals, top_k=20):
 73 |   """Performs a local (numpy) calculation of the global average precision.
 74 | 
 75 |   Only the top_k predictions are taken for each of the videos.
 76 | 
 77 |   Args:
 78 |     predictions: Matrix containing the outputs of the model.
 79 |       Dimensions are 'batch' x 'num_classes'.
 80 |     actuals: Matrix containing the ground truth labels.
 81 |       Dimensions are 'batch' x 'num_classes'.
 82 |     top_k: How many predictions to use per video.
 83 | 
 84 |   Returns:
 85 |     float: The global average precision.
 86 |   """
 87 |   gap_calculator = ap_calculator.AveragePrecisionCalculator()
 88 |   sparse_predictions, sparse_labels, num_positives = top_k_by_class(predictions, actuals, top_k)
 89 |   gap_calculator.accumulate(flatten(sparse_predictions), flatten(sparse_labels), sum(num_positives))
 90 |   return gap_calculator.peek_ap_at_n()
 91 | 
 92 | 
 93 | def top_k_by_class(predictions, labels, k=20):
 94 |   """Extracts the top k predictions for each video, sorted by class.
 95 | 
 96 |   Args:
 97 |     predictions: A numpy matrix containing the outputs of the model.
 98 |       Dimensions are 'batch' x 'num_classes'.
 99 |     k: the top k non-zero entries to preserve in each prediction.
100 | 
101 |   Returns:
102 |     A tuple (predictions,labels, true_positives). 'predictions' and 'labels'
103 |     are lists of lists of floats. 'true_positives' is a list of scalars. The
104 |     length of the lists are equal to the number of classes. The entries in the
105 |     predictions variable are probability predictions, and
106 |     the corresponding entries in the labels variable are the ground truth for
107 |     those predictions. The entries in 'true_positives' are the number of true
108 |     positives for each class in the ground truth.
109 | 
110 |   Raises:
111 |     ValueError: An error occurred when the k is not a positive integer.
112 |   """
113 |   if k <= 0:
114 |     raise ValueError("k must be a positive integer.")
115 |   k = min(k, predictions.shape[1])
116 |   num_classes = predictions.shape[1]
117 |   prediction_triplets= []
118 |   for video_index in range(predictions.shape[0]):
119 |     prediction_triplets.extend(top_k_triplets(predictions[video_index],labels[video_index], k))
120 |   out_predictions = [[] for v in range(num_classes)]
121 |   out_labels = [[] for v in range(num_classes)]
122 |   for triplet in prediction_triplets:
123 |     out_predictions[triplet[0]].append(triplet[1])
124 |     out_labels[triplet[0]].append(triplet[2])
125 |   out_true_positives = [numpy.sum(labels[:,i]) for i in range(num_classes)]
126 | 
127 |   return out_predictions, out_labels, out_true_positives
128 | 
129 | def top_k_triplets(predictions, labels, k=20):
130 |   """Get the top_k for a 1-d numpy array. Returns a sparse list of tuples in
131 |   (prediction, class) format"""
132 |   m = len(predictions)
133 |   k = min(k, m)
134 |   indices = numpy.argpartition(predictions, -k)[-k:]
135 |   return [(index, predictions[index], labels[index]) for index in indices]
136 | 
137 | class EvaluationMetrics(object):
138 |   """A class to store the evaluation metrics."""
139 | 
140 |   def __init__(self, num_class, top_k):
141 |     """Construct an EvaluationMetrics object to store the evaluation metrics.
142 | 
143 |     Args:
144 |       num_class: A positive integer specifying the number of classes.
145 |       top_k: A positive integer specifying how many predictions are considered per video.
146 | 
147 |     Raises:
148 |       ValueError: An error occurred when MeanAveragePrecisionCalculator cannot
149 |         not be constructed.
150 |     """
151 |     self.sum_hit_at_one = 0.0
152 |     self.sum_perr = 0.0
153 |     self.sum_loss = 0.0
154 |     self.map_calculator = map_calculator.MeanAveragePrecisionCalculator(num_class)
155 |     self.global_ap_calculator = ap_calculator.AveragePrecisionCalculator()
156 |     self.top_k = top_k
157 |     self.num_examples = 0
158 | 
159 |   def accumulate(self, predictions, labels, loss):
160 |     """Accumulate the metrics calculated locally for this mini-batch.
161 | 
162 |     Args:
163 |       predictions: A numpy matrix containing the outputs of the model.
164 |         Dimensions are 'batch' x 'num_classes'.
165 |       labels: A numpy matrix containing the ground truth labels.
166 |         Dimensions are 'batch' x 'num_classes'.
167 |       loss: A numpy array containing the loss for each sample.
168 | 
169 |     Returns:
170 |       dictionary: A dictionary storing the metrics for the mini-batch.
171 | 
172 |     Raises:
173 |       ValueError: An error occurred when the shape of predictions and actuals
174 |         does not match.
175 |     """
176 |     batch_size = labels.shape[0]
177 |     mean_hit_at_one = calculate_hit_at_one(predictions, labels)
178 |     mean_perr = calculate_precision_at_equal_recall_rate(predictions, labels)
179 |     mean_loss = numpy.mean(loss)
180 | 
181 |     # Take the top 20 predictions.
182 |     sparse_predictions, sparse_labels, num_positives = top_k_by_class(predictions, labels, self.top_k)
183 |     self.map_calculator.accumulate(sparse_predictions, sparse_labels, num_positives)
184 |     self.global_ap_calculator.accumulate(flatten(sparse_predictions), flatten(sparse_labels), sum(num_positives))
185 | 
186 |     self.num_examples += batch_size
187 |     self.sum_hit_at_one += mean_hit_at_one * batch_size
188 |     self.sum_perr += mean_perr * batch_size
189 |     self.sum_loss += mean_loss * batch_size
190 | 
191 |     return {"hit_at_one": mean_hit_at_one, "perr": mean_perr, "loss": mean_loss}
192 | 
193 |   def get(self):
194 |     """Calculate the evaluation metrics for the whole epoch.
195 | 
196 |     Raises:
197 |       ValueError: If no examples were accumulated.
198 | 
199 |     Returns:
200 |       dictionary: a dictionary storing the evaluation metrics for the epoch. The
201 |         dictionary has the fields: avg_hit_at_one, avg_perr, avg_loss, and
202 |         aps (default nan).
203 |     """
204 |     if self.num_examples <= 0:
205 |       raise ValueError("total_sample must be positive.")
206 |     avg_hit_at_one = self.sum_hit_at_one / self.num_examples
207 |     avg_perr = self.sum_perr / self.num_examples
208 |     avg_loss = self.sum_loss / self.num_examples
209 | 
210 |     aps = self.map_calculator.peek_map_at_n()
211 |     gap = self.global_ap_calculator.peek_ap_at_n()
212 | 
213 |     epoch_info_dict = {}
214 |     return {"avg_hit_at_one": avg_hit_at_one, "avg_perr": avg_perr,
215 |             "avg_loss": avg_loss, "aps": aps, "gap": gap}
216 | 
217 |   def clear(self):
218 |     """Clear the evaluation metrics and reset the EvaluationMetrics object."""
219 |     self.sum_hit_at_one = 0.0
220 |     self.sum_perr = 0.0
221 |     self.sum_loss = 0.0
222 |     self.map_calculator.clear()
223 |     self.global_ap_calculator.clear()
224 |     self.num_examples = 0
225 | 


--------------------------------------------------------------------------------
/mnist/export_model.py:
--------------------------------------------------------------------------------
 1 | # Copyright 2016 Google Inc. All Rights Reserved.
 2 | #
 3 | # Licensed under the Apache License, Version 2.0 (the "License");
 4 | # you may not use this file except in compliance with the License.
 5 | # You may obtain a copy of the License at
 6 | #
 7 | #      http://www.apache.org/licenses/LICENSE-2.0
 8 | #
 9 | # Unless required by applicable law or agreed to in writing, software
10 | # distributed under the License is distributed on an "AS-IS" BASIS,
11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12 | # See the License for the specific language governing permissions and
13 | # limitations under the License.
14 | """Utilities to export a model for batch prediction."""
15 | 
16 | import tensorflow as tf
17 | import tensorflow.contrib.slim as slim
18 | 
19 | from tensorflow.python.saved_model import builder as saved_model_builder
20 | from tensorflow.python.saved_model import signature_constants
21 | from tensorflow.python.saved_model import signature_def_utils
22 | from tensorflow.python.saved_model import tag_constants
23 | from tensorflow.python.saved_model import utils as saved_model_utils
24 | 
25 | _TOP_PREDICTIONS_IN_OUTPUT = 2
26 | 
27 | class ModelExporter(object):
28 | 
29 |   def __init__(self, model, reader):
30 |     self.model = model
31 |     self.reader = reader
32 | 
33 |     with tf.Graph().as_default() as graph:
34 |       self.inputs, self.outputs = self.build_inputs_and_outputs()
35 |       self.graph = graph
36 |       self.saver = tf.train.Saver(tf.trainable_variables(), sharded=True)
37 | 
38 |   def export_model(self, model_dir, global_step_val, last_checkpoint):
39 |     """Exports the model so that it can used for batch predictions."""
40 | 
41 |     with self.graph.as_default():
42 |       with tf.Session() as session:
43 |         session.run(tf.global_variables_initializer())
44 |         self.saver.restore(session, last_checkpoint)
45 | 
46 |         signature = signature_def_utils.build_signature_def(
47 |             inputs=self.inputs,
48 |             outputs=self.outputs,
49 |             method_name=signature_constants.PREDICT_METHOD_NAME)
50 | 
51 |         signature_map = {signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY:
52 |                          signature}
53 | 
54 |         model_builder = saved_model_builder.SavedModelBuilder(model_dir)
55 |         model_builder.add_meta_graph_and_variables(session,
56 |             tags=[tag_constants.SERVING],
57 |             signature_def_map=signature_map,
58 |             clear_devices=True)
59 |         model_builder.save()
60 | 
61 |   def build_inputs_and_outputs(self):
62 |     serialized_examples = tf.placeholder(tf.string, shape=(None,))
63 |     index_output, predictions_output = (
64 |         self.build_prediction_graph(serialized_examples))
65 | 
66 |     inputs = {"example_bytes":
67 |               saved_model_utils.build_tensor_info(serialized_examples)}
68 | 
69 |     outputs = {
70 |         "class_indexes": saved_model_utils.build_tensor_info(index_output),
71 |         "predictions": saved_model_utils.build_tensor_info(predictions_output)}
72 | 
73 |     return inputs, outputs
74 | 
75 |   def build_prediction_graph(self, serialized_examples):
76 |     model_input_raw, labels_batch = (
77 |         self.reader.prepare_serialized_examples(serialized_examples))
78 | 
79 |     model_input = model_input_raw
80 | 
81 |     with tf.variable_scope("tower"):
82 |       result = self.model.create_model(
83 |           model_input,
84 |           num_classes=self.reader.num_classes,
85 |           labels=labels_batch,
86 |           is_training=False)
87 | 
88 |       for variable in slim.get_model_variables():
89 |         tf.summary.histogram(variable.op.name, variable)
90 | 
91 |       predictions = result["predictions"]
92 | 
93 |       prediction, index = tf.nn.top_k(predictions, 1)
94 |     return prediction, index
95 | 


--------------------------------------------------------------------------------
/mnist/inference.py:
--------------------------------------------------------------------------------
  1 | # Copyright 2016 Google Inc. All Rights Reserved.
  2 | #
  3 | # Licensed under the Apache License, Version 2.0 (the "License");
  4 | # you may not use this file except in compliance with the License.
  5 | # You may obtain a copy of the License at
  6 | #
  7 | #      http://www.apache.org/licenses/LICENSE-2.0
  8 | #
  9 | # Unless required by applicable law or agreed to in writing, software
 10 | # distributed under the License is distributed on an "AS-IS" BASIS,
 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 12 | # See the License for the specific language governing permissions and
 13 | # limitations under the License.
 14 | 
 15 | """Binary for generating predictions over a set of videos."""
 16 | 
 17 | import os
 18 | import time
 19 | 
 20 | import numpy
 21 | import tensorflow as tf
 22 | 
 23 | from tensorflow import app
 24 | from tensorflow import flags
 25 | from tensorflow import gfile
 26 | from tensorflow import logging
 27 | 
 28 | import eval_util
 29 | import losses
 30 | import readers
 31 | import utils
 32 | 
 33 | FLAGS = flags.FLAGS
 34 | 
 35 | if __name__ == '__main__':
 36 |   flags.DEFINE_string("train_dir", "/tmp/mnist_model/",
 37 |                       "The directory to load the model files from.")
 38 |   flags.DEFINE_string("output_file", "",
 39 |                       "The file to save the predictions to.")
 40 |   flags.DEFINE_string(
 41 |       "input_data_pattern", "",
 42 |       "File glob defining the evaluation dataset in tensorflow.SequenceExample "
 43 |       "format. The SequenceExamples are expected to have an 'rgb' byte array "
 44 |       "sequence feature as well as a 'labels' int64 context feature.")
 45 | 
 46 |   # Other flags.
 47 |   flags.DEFINE_integer("num_readers", 1,
 48 |                        "How many threads to use for reading input files.")
 49 | 
 50 | def format_lines(predictions):
 51 |   batch_size = len(predictions)
 52 |   predictions = numpy.argmax(predictions, 1)
 53 |   for video_index in range(batch_size):
 54 |     yield "%d\n" % predictions[video_index]
 55 | 
 56 | 
 57 | def get_input_data_tensors(reader, data_pattern, batch_size, num_readers=1):
 58 |   """Creates the section of the graph which reads the input data.
 59 | 
 60 |   Args:
 61 |     reader: A class which parses the input data.
 62 |     data_pattern: A 'glob' style path to the data files.
 63 |     batch_size: How many examples to process at a time.
 64 |     num_readers: How many I/O threads to use.
 65 | 
 66 |   Returns:
 67 |     A tuple containing the image id tensor, image tensor and labels tensor.
 68 | 
 69 |   Raises:
 70 |     IOError: If no files matching the given pattern were found.
 71 |   """
 72 |   with tf.name_scope("input"):
 73 |     files = gfile.Glob(data_pattern)
 74 |     if not files:
 75 |       raise IOError("Unable to find input files. data_pattern='" +
 76 |                     data_pattern + "'")
 77 |     logging.info("number of input files: " + str(len(files)))
 78 |     filename_queue = tf.train.string_input_producer(
 79 |         files, num_epochs=1, shuffle=False)
 80 |     examples_and_ids = [reader.prepare_reader(filename_queue, 1)
 81 |                            for _ in range(num_readers)]
 82 | 
 83 |     image_batch, unused = (
 84 |         tf.train.batch_join(examples_and_ids,
 85 |                             batch_size=batch_size,
 86 |                             allow_smaller_final_batch=True,
 87 |                             enqueue_many=True))
 88 |     return image_batch
 89 | 
 90 | def inference(reader, train_dir, data_pattern, out_file_location, batch_size):
 91 |   with tf.Session(config=tf.ConfigProto(allow_soft_placement=True)) as sess, gfile.Open(out_file_location, "w+") as out_file:
 92 |     image_batch = get_input_data_tensors(reader, data_pattern, batch_size)
 93 |     latest_checkpoint = tf.train.latest_checkpoint(train_dir)
 94 |     if latest_checkpoint is None:
 95 |       raise Exception("unable to find a checkpoint at location: %s" % train_dir)
 96 |     else:
 97 |       meta_graph_location = latest_checkpoint + ".meta"
 98 |       logging.info("loading meta-graph: " + meta_graph_location)
 99 |     saver = tf.train.import_meta_graph(meta_graph_location, clear_devices=True)
100 |     logging.info("restoring variables from " + latest_checkpoint)
101 |     saver.restore(sess, latest_checkpoint)
102 |     input_tensor = tf.get_collection("input_batch_raw")[0]
103 |     predictions_tensor = tf.get_collection("predictions")[0]
104 | 
105 |     # Workaround for num_epochs issue.
106 |     def set_up_init_ops(variables):
107 |       init_op_list = []
108 |       for variable in list(variables):
109 |         if "train_input" in variable.name:
110 |           init_op_list.append(tf.assign(variable, 1))
111 |           variables.remove(variable)
112 |       init_op_list.append(tf.variables_initializer(variables))
113 |       return init_op_list
114 | 
115 |     sess.run(set_up_init_ops(tf.get_collection_ref(
116 |         tf.GraphKeys.LOCAL_VARIABLES)))
117 | 
118 |     coord = tf.train.Coordinator()
119 |     threads = tf.train.start_queue_runners(sess=sess, coord=coord)
120 |     num_examples_processed = 0
121 |     start_time = time.time()
122 |     out_file.write("Id,Category\n")
123 | 
124 |     try:
125 |       line_id = 1
126 |       while not coord.should_stop():
127 |           image_batch_val = sess.run(image_batch)
128 |           predictions_val = sess.run(predictions_tensor, feed_dict={input_tensor: image_batch_val})
129 |           now = time.time()
130 |           num_examples_processed += len(image_batch_val)
131 |           num_classes = predictions_val.shape[1]
132 |           logging.info("num examples processed: " + str(num_examples_processed) + " elapsed seconds: " + "{0:.2f}".format(now-start_time))
133 |           for line in format_lines(predictions_val):
134 |             out_file.write("%d,%s" % (line_id, line))
135 |             line_id += 1
136 |           out_file.flush()
137 | 
138 | 
139 |     except tf.errors.OutOfRangeError:
140 |         logging.info('Done with inference. The output file was written to ' + out_file_location)
141 |     finally:
142 |         coord.request_stop()
143 | 
144 |     coord.join(threads)
145 |     sess.close()
146 | 
147 | 
148 | def main(unused_argv):
149 |   logging.set_verbosity(tf.logging.INFO)
150 | 
151 |   reader = readers.MnistReader()
152 | 
153 |   if FLAGS.output_file is "":
154 |     raise ValueError("'output_file' was not specified. "
155 |       "Unable to continue with inference.")
156 | 
157 |   if FLAGS.input_data_pattern is "":
158 |     raise ValueError("'input_data_pattern' was not specified. "
159 |       "Unable to continue with inference.")
160 | 
161 |   inference(reader, FLAGS.train_dir, FLAGS.input_data_pattern,
162 |             FLAGS.output_file, 8192)
163 | 
164 | 
165 | if __name__ == "__main__":
166 |   app.run()
167 | 


--------------------------------------------------------------------------------
/mnist/losses.py:
--------------------------------------------------------------------------------
 1 | # Copyright 2016 Google Inc. All Rights Reserved.
 2 | #
 3 | # Licensed under the Apache License, Version 2.0 (the "License");
 4 | # you may not use this file except in compliance with the License.
 5 | # You may obtain a copy of the License at
 6 | #
 7 | #      http://www.apache.org/licenses/LICENSE-2.0
 8 | #
 9 | # Unless required by applicable law or agreed to in writing, software
10 | # distributed under the License is distributed on an "AS-IS" BASIS,
11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12 | # See the License for the specific language governing permissions and
13 | # limitations under the License.
14 | 
15 | """Provides definitions for non-regularized training or test losses."""
16 | 
17 | import tensorflow as tf
18 | 
19 | 
20 | class BaseLoss(object):
21 |   """Inherit from this class when implementing new losses."""
22 | 
23 |   def calculate_loss(self, unused_predictions, unused_labels, **unused_params):
24 |     """Calculates the average loss of the examples in a mini-batch.
25 | 
26 |      Args:
27 |       unused_predictions: a 2-d tensor storing the prediction scores, in which
28 |         each row represents a sample in the mini-batch and each column
29 |         represents a class.
30 |       unused_labels: a 2-d tensor storing the labels, which has the same shape
31 |         as the unused_predictions. The labels must be in the range of 0 and 1.
32 |       unused_params: loss specific parameters.
33 | 
34 |     Returns:
35 |       A scalar loss tensor.
36 |     """
37 |     raise NotImplementedError()
38 | 
39 | 
40 | class CrossEntropyLoss(BaseLoss):
41 |   """Calculate the cross entropy loss between the predictions and labels.
42 |   """
43 | 
44 |   def calculate_loss(self, predictions, labels, **unused_params):
45 |     with tf.name_scope("loss_xent"):
46 |       epsilon = 10e-6
47 |       float_labels = tf.cast(labels, tf.float32)
48 |       cross_entropy_loss = float_labels * tf.log(predictions + epsilon) + (
49 |           1 - float_labels) * tf.log(1 - predictions + epsilon)
50 |       cross_entropy_loss = tf.negative(cross_entropy_loss)
51 |       return tf.reduce_mean(tf.reduce_sum(cross_entropy_loss, 1))
52 | 
53 | 
54 | class HingeLoss(BaseLoss):
55 |   """Calculate the hinge loss between the predictions and labels.
56 | 
57 |   Note the subgradient is used in the backpropagation, and thus the optimization
58 |   may converge slower. The predictions trained by the hinge loss are between -1
59 |   and +1.
60 |   """
61 | 
62 |   def calculate_loss(self, predictions, labels, b=1.0, **unused_params):
63 |     with tf.name_scope("loss_hinge"):
64 |       float_labels = tf.cast(labels, tf.float32)
65 |       all_zeros = tf.zeros(tf.shape(float_labels), dtype=tf.float32)
66 |       all_ones = tf.ones(tf.shape(float_labels), dtype=tf.float32)
67 |       sign_labels = tf.subtract(tf.scalar_mul(2, float_labels), all_ones)
68 |       hinge_loss = tf.maximum(
69 |           all_zeros, tf.scalar_mul(b, all_ones) - sign_labels * predictions)
70 |       return tf.reduce_mean(tf.reduce_sum(hinge_loss, 1))
71 | 
72 | 
73 | class SoftmaxLoss(BaseLoss):
74 |   """Calculate the softmax loss between the predictions and labels.
75 | 
76 |   The function calculates the loss in the following way: first we feed the
77 |   predictions to the softmax activation function and then we calculate
78 |   the minus linear dot product between the logged softmax activations and the
79 |   normalized ground truth label.
80 | 
81 |   It is an extension to the one-hot label. It allows for more than one positive
82 |   labels for each sample.
83 |   """
84 | 
85 |   def calculate_loss(self, predictions, labels, **unused_params):
86 |     with tf.name_scope("loss_softmax"):
87 |       epsilon = 10e-8
88 |       float_labels = tf.cast(labels, tf.float32)
89 |       # l1 normalization (labels are no less than 0)
90 |       label_rowsum = tf.maximum(
91 |           tf.reduce_sum(float_labels, 1, keep_dims=True),
92 |           epsilon)
93 |       norm_float_labels = tf.div(float_labels, label_rowsum)
94 |       softmax_outputs = tf.nn.softmax(predictions)
95 |       softmax_loss = tf.negative(tf.reduce_sum(
96 |           tf.multiply(norm_float_labels, tf.log(softmax_outputs)), 1))
97 |     return tf.reduce_mean(softmax_loss)
98 | 


--------------------------------------------------------------------------------
/mnist/mean_average_precision_calculator.py:
--------------------------------------------------------------------------------
  1 | # Copyright 2016 Google Inc. All Rights Reserved.
  2 | #
  3 | # Licensed under the Apache License, Version 2.0 (the "License");
  4 | # you may not use this file except in compliance with the License.
  5 | # You may obtain a copy of the License at
  6 | #
  7 | #      http://www.apache.org/licenses/LICENSE-2.0
  8 | #
  9 | # Unless required by applicable law or agreed to in writing, software
 10 | # distributed under the License is distributed on an "AS-IS" BASIS,
 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 12 | # See the License for the specific language governing permissions and
 13 | # limitations under the License.
 14 | 
 15 | """Calculate the mean average precision.
 16 | 
 17 | It provides an interface for calculating mean average precision
 18 | for an entire list or the top-n ranked items.
 19 | 
 20 | Example usages:
 21 | We first call the function accumulate many times to process parts of the ranked
 22 | list. After processing all the parts, we call peek_map_at_n
 23 | to calculate the mean average precision.
 24 | 
 25 | ```
 26 | import random
 27 | 
 28 | p = np.array([[random.random() for _ in xrange(50)] for _ in xrange(1000)])
 29 | a = np.array([[random.choice([0, 1]) for _ in xrange(50)]
 30 |      for _ in xrange(1000)])
 31 | 
 32 | # mean average precision for 50 classes.
 33 | calculator = mean_average_precision_calculator.MeanAveragePrecisionCalculator(
 34 |             num_class=50)
 35 | calculator.accumulate(p, a)
 36 | aps = calculator.peek_map_at_n()
 37 | ```
 38 | """
 39 | 
 40 | import numpy
 41 | import average_precision_calculator
 42 | 
 43 | 
 44 | class MeanAveragePrecisionCalculator(object):
 45 |   """This class is to calculate mean average precision.
 46 |   """
 47 | 
 48 |   def __init__(self, num_class):
 49 |     """Construct a calculator to calculate the (macro) average precision.
 50 | 
 51 |     Args:
 52 |       num_class: A positive Integer specifying the number of classes.
 53 |       top_n_array: A list of positive integers specifying the top n for each
 54 |       class. The top n in each class will be used to calculate its average
 55 |       precision at n.
 56 |       The size of the array must be num_class.
 57 | 
 58 |     Raises:
 59 |       ValueError: An error occurred when num_class is not a positive integer;
 60 |       or the top_n_array is not a list of positive integers.
 61 |     """
 62 |     if not isinstance(num_class, int) or num_class <= 1:
 63 |       raise ValueError("num_class must be a positive integer.")
 64 | 
 65 |     self._ap_calculators = []  # member of AveragePrecisionCalculator
 66 |     self._num_class = num_class  # total number of classes
 67 |     for i in range(num_class):
 68 |       self._ap_calculators.append(
 69 |           average_precision_calculator.AveragePrecisionCalculator())
 70 | 
 71 |   def accumulate(self, predictions, actuals, num_positives=None):
 72 |     """Accumulate the predictions and their ground truth labels.
 73 | 
 74 |     Args:
 75 |       predictions: A list of lists storing the prediction scores. The outer
 76 |       dimension corresponds to classes.
 77 |       actuals: A list of lists storing the ground truth labels. The dimensions
 78 |       should correspond to the predictions input. Any value
 79 |       larger than 0 will be treated as positives, otherwise as negatives.
 80 |       num_positives: If provided, it is a list of numbers representing the
 81 |       number of true positives for each class. If not provided, the number of
 82 |       true positives will be inferred from the 'actuals' array.
 83 | 
 84 |     Raises:
 85 |       ValueError: An error occurred when the shape of predictions and actuals
 86 |       does not match.
 87 |     """
 88 |     if not num_positives:
 89 |       num_positives = [None for i in predictions.shape[1]]
 90 | 
 91 |     calculators = self._ap_calculators
 92 |     for i in range(len(predictions)):
 93 |       calculators[i].accumulate(predictions[i], actuals[i], num_positives[i])
 94 | 
 95 |   def clear(self):
 96 |     for calculator in self._ap_calculators:
 97 |       calculator.clear()
 98 | 
 99 |   def is_empty(self):
100 |     return ([calculator.heap_size for calculator in self._ap_calculators] ==
101 |             [0 for _ in range(self._num_class)])
102 | 
103 |   def peek_map_at_n(self):
104 |     """Peek the non-interpolated mean average precision at n.
105 | 
106 |     Returns:
107 |       An array of non-interpolated average precision at n (default 0) for each
108 |       class.
109 |     """
110 |     aps = [self._ap_calculators[i].peek_ap_at_n()
111 |            for i in range(self._num_class)]
112 |     return aps
113 | 


--------------------------------------------------------------------------------
/mnist/mnist_models.py:
--------------------------------------------------------------------------------
 1 | # Copyright 2016 Google Inc. All Rights Reserved.
 2 | #
 3 | # Licensed under the Apache License, Version 2.0 (the "License");
 4 | # you may not use this file except in compliance with the License.
 5 | # You may obtain a copy of the License at
 6 | #
 7 | #      http://www.apache.org/licenses/LICENSE-2.0
 8 | #
 9 | # Unless required by applicable law or agreed to in writing, software
10 | # distributed under the License is distributed on an "AS-IS" BASIS,
11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12 | # See the License for the specific language governing permissions and
13 | # limitations under the License.
14 | 
15 | """Contains model definitions."""
16 | import math
17 | 
18 | import models
19 | import tensorflow as tf
20 | import utils
21 | 
22 | import tensorflow.contrib.slim as slim
23 | 
24 | class LogisticModel(models.BaseModel):
25 |   """Logistic model with L2 regularization."""
26 | 
27 |   def create_model(self, model_input, num_classes=10, l2_penalty=1e-8, **unused_params):
28 |     """Creates a logistic model.
29 | 
30 |     Args:
31 |       model_input: 'batch' x 'num_features' matrix of input features.
32 |       vocab_size: The number of classes in the dataset.
33 | 
34 |     Returns:
35 |       A dictionary with a tensor containing the probability predictions of the
36 |       model in the 'predictions' key. The dimensions of the tensor are
37 |       batch_size x num_classes."""
38 |     output = slim.fully_connected(
39 |         model_input, num_classes, activation_fn=tf.nn.softmax,
40 |         weights_regularizer=slim.l2_regularizer(l2_penalty))
41 |     return {"predictions": output}
42 | 


--------------------------------------------------------------------------------
/mnist/models.py:
--------------------------------------------------------------------------------
 1 | # Copyright 2016 Google Inc. All Rights Reserved.
 2 | #
 3 | # Licensed under the Apache License, Version 2.0 (the "License");
 4 | # you may not use this file except in compliance with the License.
 5 | # You may obtain a copy of the License at
 6 | #
 7 | #      http://www.apache.org/licenses/LICENSE-2.0
 8 | #
 9 | # Unless required by applicable law or agreed to in writing, software
10 | # distributed under the License is distributed on an "AS-IS" BASIS,
11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12 | # See the License for the specific language governing permissions and
13 | # limitations under the License.
14 | 
15 | """Contains the base class for models."""
16 | 
17 | class BaseModel(object):
18 |   """Inherit from this class when implementing new models."""
19 | 
20 |   def create_model(self, unused_model_input, **unused_params):
21 |     raise NotImplementedError()
22 | 


--------------------------------------------------------------------------------
/mnist/readers.py:
--------------------------------------------------------------------------------
  1 | # Copyright 2016 Google Inc. All Rights Reserved.
  2 | #
  3 | # Licensed under the Apache License, Version 2.0 (the "License");
  4 | # you may not use this file except in compliance with the License.
  5 | # You may obtain a copy of the License at
  6 | #
  7 | #      http://www.apache.org/licenses/LICENSE-2.0
  8 | #
  9 | # Unless required by applicable law or agreed to in writing, software
 10 | # distributed under the License is distributed on an "AS-IS" BASIS,
 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 12 | # See the License for the specific language governing permissions and
 13 | # limitations under the License.
 14 | 
 15 | """Provides readers configured for different datasets."""
 16 | 
 17 | import tensorflow as tf
 18 | import utils
 19 | import tensorflow.contrib.slim as slim
 20 | 
 21 | from tensorflow import logging
 22 | import numpy as np
 23 | 
 24 | def resize_axis(tensor, axis, new_size, fill_value=0):
 25 |   """Truncates or pads a tensor to new_size on on a given axis.
 26 | 
 27 |   Truncate or extend tensor such that tensor.shape[axis] == new_size. If the
 28 |   size increases, the padding will be performed at the end, using fill_value.
 29 | 
 30 |   Args:
 31 |     tensor: The tensor to be resized.
 32 |     axis: An integer representing the dimension to be sliced.
 33 |     new_size: An integer or 0d tensor representing the new value for
 34 |       tensor.shape[axis].
 35 |     fill_value: Value to use to fill any new entries in the tensor. Will be
 36 |       cast to the type of tensor.
 37 | 
 38 |   Returns:
 39 |     The resized tensor.
 40 |   """
 41 |   tensor = tf.convert_to_tensor(tensor)
 42 |   shape = tf.unstack(tf.shape(tensor))
 43 | 
 44 |   pad_shape = shape[:]
 45 |   pad_shape[axis] = tf.maximum(0, new_size - shape[axis])
 46 | 
 47 |   shape[axis] = tf.minimum(shape[axis], new_size)
 48 |   shape = tf.stack(shape)
 49 | 
 50 |   resized = tf.concat([
 51 |       tf.slice(tensor, tf.zeros_like(shape), shape),
 52 |       tf.fill(tf.stack(pad_shape), tf.cast(fill_value, tensor.dtype))
 53 |   ], axis)
 54 | 
 55 |   # Update shape.
 56 |   new_shape = tensor.get_shape().as_list()  # A copy is being made.
 57 |   new_shape[axis] = new_size
 58 |   resized.set_shape(new_shape)
 59 |   return resized
 60 | 
 61 | class BaseReader(object):
 62 |   """Inherit from this class when implementing new readers."""
 63 | 
 64 |   def prepare_reader(self, unused_filename_queue):
 65 |     """Create a thread for generating prediction and label tensors."""
 66 |     raise NotImplementedError()
 67 | 
 68 | 
 69 | class MnistReader(BaseReader):
 70 |   """Reads TFRecords of pre-aggregated Examples.
 71 |   """
 72 | 
 73 |   def __init__(self,
 74 |                num_classes=10):
 75 |     self.num_classes = num_classes
 76 | 
 77 |   def prepare_reader(self, filename_queue, batch_size=1024):
 78 |     reader = tf.TFRecordReader()
 79 |     _, serialized_examples = reader.read_up_to(filename_queue, batch_size)
 80 | 
 81 |     print("start " + str(serialized_examples.get_shape().ndims))
 82 |     tf.add_to_collection("serialized_examples", serialized_examples)
 83 |     return self.prepare_serialized_examples(serialized_examples)
 84 | 
 85 |   def prepare_serialized_examples(self, serialized_examples):
 86 |     feature_map = {
 87 |         'image_raw': tf.FixedLenFeature([784], tf.int64),
 88 |         'label': tf.FixedLenFeature([], tf.int64),
 89 |     }
 90 |     features = tf.parse_example(serialized_examples, features=feature_map)
 91 | 
 92 |     images = tf.cast(features["image_raw"], tf.float32) * (1. / 255)
 93 |     labels = tf.cast(features['label'], tf.int32)
 94 | 
 95 |     def dense_to_one_hot(label_batch, num_classes):
 96 |       one_hot = tf.map_fn(lambda x : tf.cast(slim.one_hot_encoding(x, num_classes), tf.int32), label_batch)
 97 |       one_hot = tf.reshape(one_hot, [-1, num_classes])
 98 |       return one_hot
 99 | 
100 |     labels = dense_to_one_hot(labels, 10)
101 |     return images, labels
102 | 


--------------------------------------------------------------------------------
/mnist/train.py:
--------------------------------------------------------------------------------
  1 | # Copyright 2016 Google Inc. All Rights Reserved.
  2 | #
  3 | # Licensed under the Apache License, Version 2.0 (the "License");
  4 | # you may not use this file except in compliance with the License.
  5 | # You may obtain a copy of the License at
  6 | #
  7 | #      http://www.apache.org/licenses/LICENSE-2.0
  8 | #
  9 | # Unless required by applicable law or agreed to in writing, software
 10 | # distributed under the License is distributed on an "AS-IS" BASIS,
 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 12 | # See the License for the specific language governing permissions and
 13 | # limitations under the License.
 14 | """Binary for training Tensorflow models on the YouTube-8M dataset."""
 15 | 
 16 | import json
 17 | import os
 18 | import time
 19 | 
 20 | import eval_util
 21 | import export_model
 22 | import losses
 23 | import mnist_models
 24 | import readers
 25 | import tensorflow as tf
 26 | import tensorflow.contrib.slim as slim
 27 | from tensorflow import app
 28 | from tensorflow import flags
 29 | from tensorflow import gfile
 30 | from tensorflow import logging
 31 | from tensorflow.python.client import device_lib
 32 | import utils
 33 | 
 34 | FLAGS = flags.FLAGS
 35 | 
 36 | if __name__ == "__main__":
 37 |   # Dataset flags.
 38 |   flags.DEFINE_string("train_dir", "/tmp/mnist_model/",
 39 |                       "The directory to save the model files in.")
 40 |   flags.DEFINE_string(
 41 |       "train_data_pattern", "",
 42 |       "File glob for the training dataset.")
 43 | 
 44 |   flags.DEFINE_string(
 45 |       "model", "LogisticModel",
 46 |       "Which architecture to use for the model. Models are defined "
 47 |       "in models.py.")
 48 |   flags.DEFINE_bool(
 49 |       "start_new_model", False,
 50 |       "If set, this will not resume from a checkpoint and will instead create a"
 51 |       " new model instance.")
 52 | 
 53 |   # Training flags.
 54 |   flags.DEFINE_integer("batch_size", 512,
 55 |                        "How many examples to process per batch for training.")
 56 |   flags.DEFINE_string("label_loss", "CrossEntropyLoss",
 57 |                       "Which loss function to use for training the model.")
 58 |   flags.DEFINE_float(
 59 |       "regularization_penalty", 1.0,
 60 |       "How much weight to give to the regularization loss (the label loss has "
 61 |       "a weight of 1).")
 62 | 
 63 |   flags.DEFINE_float("base_learning_rate", 0.01,
 64 |                      "Which learning rate to start with.")
 65 | 
 66 |   flags.DEFINE_float("learning_rate_decay", 0.95,
 67 |                      "Learning rate decay factor to be applied every "
 68 |                      "learning_rate_decay_examples.")
 69 | 
 70 |   flags.DEFINE_float("learning_rate_decay_examples", 4000000,
 71 |                      "Multiply current learning rate by learning_rate_decay "
 72 |                      "every learning_rate_decay_examples.")
 73 | 
 74 |   flags.DEFINE_integer("num_epochs", 100,
 75 |                        "How many passes to make over the dataset before "
 76 |                        "halting training.")
 77 | 
 78 |   flags.DEFINE_integer("max_steps", None,
 79 |                        "The maximum number of iterations of the training loop.")
 80 | 
 81 |   flags.DEFINE_integer("export_model_steps", 1000,
 82 |                        "The period, in number of steps, with which the model "
 83 |                        "is exported for batch prediction.")
 84 | 
 85 |   # Other flags.
 86 |   flags.DEFINE_integer("num_readers", 1,
 87 |                        "How many threads to use for reading input files.")
 88 | 
 89 |   flags.DEFINE_string("optimizer", "AdamOptimizer",
 90 |                       "What optimizer class to use.")
 91 | 
 92 |   flags.DEFINE_float("clip_gradient_norm", 1.0, "Norm to clip gradients to.")
 93 | 
 94 |   flags.DEFINE_bool(
 95 |       "log_device_placement", False,
 96 |       "Whether to write the device on which every op will run into the "
 97 |       "logs on startup.")
 98 | 
 99 | def validate_class_name(flag_value, category, modules, expected_superclass):
100 |   """Checks that the given string matches a class of the expected type.
101 | 
102 |   Args:
103 |     flag_value: A string naming the class to instantiate.
104 |     category: A string used further describe the class in error messages
105 |               (e.g. 'model', 'reader', 'loss').
106 |     modules: A list of modules to search for the given class.
107 |     expected_superclass: A class that the given class should inherit from.
108 | 
109 |   Raises:
110 |     FlagsError: If the given class could not be found or if the first class
111 |     found with that name doesn't inherit from the expected superclass.
112 | 
113 |   Returns:
114 |     True if a class was found that matches the given constraints.
115 |   """
116 |   candidates = [getattr(module, flag_value, None) for module in modules]
117 |   for candidate in candidates:
118 |     if not candidate:
119 |       continue
120 |     if not issubclass(candidate, expected_superclass):
121 |       raise flags.FlagsError("%s '%s' doesn't inherit from %s." %
122 |                              (category, flag_value,
123 |                               expected_superclass.__name__))
124 |     return True
125 |   raise flags.FlagsError("Unable to find %s '%s'." % (category, flag_value))
126 | 
127 | def get_input_data_tensors(reader,
128 |                            data_pattern,
129 |                            batch_size=1000,
130 |                            num_epochs=None,
131 |                            num_readers=1):
132 |   """Creates the section of the graph which reads the training data.
133 | 
134 |   Args:
135 |     reader: A class which parses the training data.
136 |     data_pattern: A 'glob' style path to the data files.
137 |     batch_size: How many examples to process at a time.
138 |     num_epochs: How many passes to make over the training data. Set to 'None'
139 |                 to run indefinitely.
140 |     num_readers: How many I/O threads to use.
141 | 
142 |   Returns:
143 |     A tuple containing the ids tensor, images tensor, labels tensor.
144 | 
145 |   Raises:
146 |     IOError: If no files matching the given pattern were found.
147 |   """
148 |   logging.info("Using batch size of " + str(batch_size) + " for training.")
149 |   with tf.name_scope("train_input"):
150 |     files = gfile.Glob(data_pattern)
151 |     if not files:
152 |       raise IOError("Unable to find training files. data_pattern='" +
153 |                     data_pattern + "'.")
154 |     logging.info("Number of training files: %s.", str(len(files)))
155 |     filename_queue = tf.train.string_input_producer(
156 |         files, num_epochs=num_epochs, shuffle=True)
157 | 
158 |     training_data = [
159 |         reader.prepare_reader(filename_queue, FLAGS.batch_size) for _ in range(num_readers)
160 |     ]
161 | 
162 |     return tf.train.shuffle_batch_join(
163 |         training_data,
164 |         batch_size=batch_size,
165 |         capacity=batch_size * 5,
166 |         min_after_dequeue=batch_size,
167 |         allow_smaller_final_batch=True,
168 |         enqueue_many=True)
169 | 
170 | 
171 | def find_class_by_name(name, modules):
172 |   """Searches the provided modules for the named class and returns it."""
173 |   modules = [getattr(module, name, None) for module in modules]
174 |   return next(a for a in modules if a)
175 | 
176 | def build_graph(reader,
177 |                 model,
178 |                 train_data_pattern,
179 |                 label_loss_fn=losses.SoftmaxLoss(),
180 |                 batch_size=1000,
181 |                 base_learning_rate=0.01,
182 |                 learning_rate_decay_examples=1000000,
183 |                 learning_rate_decay=0.95,
184 |                 optimizer_class=tf.train.AdamOptimizer,
185 |                 clip_gradient_norm=1.0,
186 |                 regularization_penalty=1,
187 |                 num_readers=1,
188 |                 num_epochs=None):
189 |   """Creates the Tensorflow graph.
190 | 
191 |   This will only be called once in the life of
192 |   a training model, because after the graph is created the model will be
193 |   restored from a meta graph file rather than being recreated.
194 | 
195 |   Args:
196 |     reader: The data file reader. It should inherit from BaseReader.
197 |     model: The core model (e.g. logistic or neural net). It should inherit
198 |            from BaseModel.
199 |     train_data_pattern: glob path to the training data files.
200 |     label_loss_fn: What kind of loss to apply to the model. It should inherit
201 |                 from BaseLoss.
202 |     batch_size: How many examples to process at a time.
203 |     base_learning_rate: What learning rate to initialize the optimizer with.
204 |     optimizer_class: Which optimization algorithm to use.
205 |     clip_gradient_norm: Magnitude of the gradient to clip to.
206 |     regularization_penalty: How much weight to give the regularization loss
207 |                             compared to the label loss.
208 |     num_readers: How many threads to use for I/O operations.
209 |     num_epochs: How many passes to make over the data. 'None' means an
210 |                 unlimited number of passes.
211 |   """
212 | 
213 |   global_step = tf.Variable(0, trainable=False, name="global_step")
214 | 
215 |   local_device_protos = device_lib.list_local_devices()
216 |   gpus = [x.name for x in local_device_protos if x.device_type == 'GPU']
217 |   num_gpus = len(gpus)
218 | 
219 |   if num_gpus > 0:
220 |     logging.info("Using the following GPUs to train: " + str(gpus))
221 |     num_towers = num_gpus
222 |     device_string = '/gpu:%d'
223 |   else:
224 |     logging.info("No GPUs found. Training on CPU.")
225 |     num_towers = 1
226 |     device_string = '/cpu:%d'
227 | 
228 |   learning_rate = tf.train.exponential_decay(
229 |       base_learning_rate,
230 |       global_step * batch_size * num_towers,
231 |       learning_rate_decay_examples,
232 |       learning_rate_decay,
233 |       staircase=True)
234 |   tf.summary.scalar('learning_rate', learning_rate)
235 | 
236 |   optimizer = optimizer_class(learning_rate)
237 |   model_input_raw, labels_batch = (
238 |       get_input_data_tensors(
239 |           reader,
240 |           train_data_pattern,
241 |           batch_size=batch_size * num_towers,
242 |           num_readers=num_readers,
243 |           num_epochs=num_epochs))
244 |   tf.summary.histogram("model/input_raw", model_input_raw)
245 | 
246 |   model_input = model_input_raw
247 | 
248 |   tower_inputs = tf.split(model_input, num_towers)
249 |   tower_labels = tf.split(labels_batch, num_towers)
250 |   tower_gradients = []
251 |   tower_predictions = []
252 |   tower_label_losses = []
253 |   tower_reg_losses = []
254 |   for i in range(num_towers):
255 |     # For some reason these 'with' statements can't be combined onto the same
256 |     # line. They have to be nested.
257 |     with tf.device(device_string % i):
258 |       with (tf.variable_scope(("tower"), reuse=True if i > 0 else None)):
259 |         with (slim.arg_scope([slim.model_variable, slim.variable], device="/cpu:0" if num_gpus!=1 else "/gpu:0")):
260 |           result = model.create_model(
261 |             tower_inputs[i],
262 |             vocab_size=reader.num_classes,
263 |             labels=tower_labels[i])
264 |           for variable in slim.get_model_variables():
265 |             tf.summary.histogram(variable.op.name, variable)
266 | 
267 |           predictions = result["predictions"]
268 |           tower_predictions.append(predictions)
269 | 
270 |           if "loss" in result.keys():
271 |             label_loss = result["loss"]
272 |           else:
273 |             label_loss = label_loss_fn.calculate_loss(predictions, tower_labels[i])
274 | 
275 |           if "regularization_loss" in result.keys():
276 |             reg_loss = result["regularization_loss"]
277 |           else:
278 |             reg_loss = tf.constant(0.0)
279 | 
280 |           reg_losses = tf.losses.get_regularization_losses()
281 |           if reg_losses:
282 |             reg_loss += tf.add_n(reg_losses)
283 | 
284 |           tower_reg_losses.append(reg_loss)
285 | 
286 |           # Adds update_ops (e.g., moving average updates in batch normalization) as
287 |           # a dependency to the train_op.
288 |           update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
289 |           if "update_ops" in result.keys():
290 |             update_ops += result["update_ops"]
291 |           if update_ops:
292 |             with tf.control_dependencies(update_ops):
293 |               barrier = tf.no_op(name="gradient_barrier")
294 |               with tf.control_dependencies([barrier]):
295 |                 label_loss = tf.identity(label_loss)
296 | 
297 |           tower_label_losses.append(label_loss)
298 | 
299 |           # Incorporate the L2 weight penalties etc.
300 |           final_loss = regularization_penalty * reg_loss + label_loss
301 |           gradients = optimizer.compute_gradients(final_loss,
302 |               colocate_gradients_with_ops=False)
303 |           tower_gradients.append(gradients)
304 |   label_loss = tf.reduce_mean(tf.stack(tower_label_losses))
305 |   tf.summary.scalar("label_loss", label_loss)
306 |   if regularization_penalty != 0:
307 |     reg_loss = tf.reduce_mean(tf.stack(tower_reg_losses))
308 |     tf.summary.scalar("reg_loss", reg_loss)
309 |   merged_gradients = utils.combine_gradients(tower_gradients)
310 | 
311 |   if clip_gradient_norm > 0:
312 |     with tf.name_scope('clip_grads'):
313 |       merged_gradients = utils.clip_gradient_norms(merged_gradients, clip_gradient_norm)
314 | 
315 |   train_op = optimizer.apply_gradients(merged_gradients, global_step=global_step)
316 | 
317 |   tf.add_to_collection("global_step", global_step)
318 |   tf.add_to_collection("loss", label_loss)
319 |   tf.add_to_collection("predictions", tf.concat(tower_predictions, 0))
320 |   tf.add_to_collection("input_batch_raw", model_input_raw)
321 |   tf.add_to_collection("input_batch", model_input)
322 |   tf.add_to_collection("labels", tf.cast(labels_batch, tf.float32))
323 |   tf.add_to_collection("train_op", train_op)
324 | 
325 | 
326 | class Trainer(object):
327 |   """A Trainer to train a Tensorflow graph."""
328 | 
329 |   def __init__(self, cluster, task, train_dir, model, reader, model_exporter,
330 |                log_device_placement=True, max_steps=None,
331 |                export_model_steps=1000):
332 |     """"Creates a Trainer.
333 | 
334 |     Args:
335 |       cluster: A tf.train.ClusterSpec if the execution is distributed.
336 |         None otherwise.
337 |       task: A TaskSpec describing the job type and the task index.
338 |     """
339 | 
340 |     self.cluster = cluster
341 |     self.task = task
342 |     self.is_master = (task.type == "master" and task.index == 0)
343 |     self.train_dir = train_dir
344 |     self.config = tf.ConfigProto(
345 |         allow_soft_placement=True,log_device_placement=log_device_placement)
346 |     self.model = model
347 |     self.reader = reader
348 |     self.model_exporter = model_exporter
349 |     self.max_steps = max_steps
350 |     self.max_steps_reached = False
351 |     self.export_model_steps = export_model_steps
352 |     self.last_model_export_step = 0
353 | 
354 | #     if self.is_master and self.task.index > 0:
355 | #       raise StandardError("%s: Only one replica of master expected",
356 | #                           task_as_string(self.task))
357 | 
358 |   def run(self, start_new_model=False):
359 |     """Performs training on the currently defined Tensorflow graph.
360 | 
361 |     Returns:
362 |       A tuple of the training Hit@1 and the training PERR.
363 |     """
364 |     if self.is_master and start_new_model:
365 |       self.remove_training_directory(self.train_dir)
366 | 
367 |     target, device_fn = self.start_server_if_distributed()
368 | 
369 |     meta_filename = self.get_meta_filename(start_new_model, self.train_dir)
370 | 
371 |     with tf.Graph().as_default() as graph:
372 | 
373 |       if meta_filename:
374 |         saver = self.recover_model(meta_filename)
375 | 
376 |       with tf.device(device_fn):
377 |         if not meta_filename:
378 |           saver = self.build_model(self.model, self.reader)
379 | 
380 |         global_step = tf.get_collection("global_step")[0]
381 |         loss = tf.get_collection("loss")[0]
382 |         predictions = tf.get_collection("predictions")[0]
383 |         labels = tf.get_collection("labels")[0]
384 |         train_op = tf.get_collection("train_op")[0]
385 |         init_op = tf.global_variables_initializer()
386 | 
387 |     sv = tf.train.Supervisor(
388 |         graph,
389 |         logdir=self.train_dir,
390 |         init_op=init_op,
391 |         is_chief=self.is_master,
392 |         global_step=global_step,
393 |         save_model_secs=15 * 60,
394 |         save_summaries_secs=120,
395 |         saver=saver)
396 | 
397 |     logging.info("%s: Starting managed session.", task_as_string(self.task))
398 |     with sv.managed_session(target, config=self.config) as sess:
399 |       try:
400 |         logging.info("%s: Entering training loop.", task_as_string(self.task))
401 |         while (not sv.should_stop()) and (not self.max_steps_reached):
402 |           batch_start_time = time.time()
403 |           _, global_step_val, loss_val, predictions_val, labels_val = sess.run(
404 |               [train_op, global_step, loss, predictions, labels])
405 |           seconds_per_batch = time.time() - batch_start_time
406 |           examples_per_second = labels_val.shape[0] / seconds_per_batch
407 | 
408 |           if self.max_steps and self.max_steps <= global_step_val:
409 |             self.max_steps_reached = True
410 | 
411 |           if self.is_master and global_step_val % 10 == 0 and self.train_dir:
412 |             eval_start_time = time.time()
413 |             hit_at_one = eval_util.calculate_hit_at_one(predictions_val, labels_val)
414 |             perr = eval_util.calculate_precision_at_equal_recall_rate(predictions_val,
415 |                                                                       labels_val)
416 |             gap = eval_util.calculate_gap(predictions_val, labels_val)
417 |             eval_end_time = time.time()
418 |             eval_time = eval_end_time - eval_start_time
419 | 
420 |             logging.info("training step " + str(global_step_val) + " | Loss: " + ("%.2f" % loss_val) +
421 |               " Examples/sec: " + ("%.2f" % examples_per_second) + " | Hit@1: " +
422 |               ("%.2f" % hit_at_one) + " PERR: " + ("%.2f" % perr) +
423 |               " GAP: " + ("%.2f" % gap))
424 | 
425 |             sv.summary_writer.add_summary(
426 |                 utils.MakeSummary("model/Training_Hit@1", hit_at_one),
427 |                 global_step_val)
428 |             sv.summary_writer.add_summary(
429 |                 utils.MakeSummary("model/Training_Perr", perr), global_step_val)
430 |             sv.summary_writer.add_summary(
431 |                 utils.MakeSummary("model/Training_GAP", gap), global_step_val)
432 |             sv.summary_writer.add_summary(
433 |                 utils.MakeSummary("global_step/Examples/Second",
434 |                                   examples_per_second), global_step_val)
435 |             sv.summary_writer.flush()
436 | 
437 |             # Exporting the model every x steps
438 |             time_to_export = ((self.last_model_export_step == 0) or
439 |                 (global_step_val - self.last_model_export_step
440 |                  >= self.export_model_steps))
441 | 
442 |             if self.is_master and time_to_export:
443 |               self.export_model(global_step_val, sv.saver, sv.save_path, sess)
444 |               self.last_model_export_step = global_step_val
445 |           else:
446 |             logging.info("training step " + str(global_step_val) + " | Loss: " +
447 |               ("%.2f" % loss_val) + " Examples/sec: " + ("%.2f" % examples_per_second))
448 |       except tf.errors.OutOfRangeError:
449 |         logging.info("%s: Done training -- epoch limit reached.",
450 |                      task_as_string(self.task))
451 | 
452 |     logging.info("%s: Exited training loop.", task_as_string(self.task))
453 |     sv.Stop()
454 | 
455 |   def export_model(self, global_step_val, saver, save_path, session):
456 | 
457 |     # If the model has already been exported at this step, return.
458 |     if global_step_val == self.last_model_export_step:
459 |       return
460 | 
461 |     last_checkpoint = saver.save(session, save_path, global_step_val)
462 | 
463 |     model_dir = "{0}/export/step_{1}".format(self.train_dir, global_step_val)
464 |     logging.info("%s: Exporting the model at step %s to %s.",
465 |                  task_as_string(self.task), global_step_val, model_dir)
466 | 
467 |     self.model_exporter.export_model(
468 |         model_dir=model_dir,
469 |         global_step_val=global_step_val,
470 |         last_checkpoint=last_checkpoint)
471 | 
472 | 
473 |   def start_server_if_distributed(self):
474 |     """Starts a server if the execution is distributed."""
475 | 
476 |     if self.cluster:
477 |       logging.info("%s: Starting trainer within cluster %s.",
478 |                    task_as_string(self.task), self.cluster.as_dict())
479 |       server = start_server(self.cluster, self.task)
480 |       target = server.target
481 |       device_fn = tf.train.replica_device_setter(
482 |           ps_device="/job:ps",
483 |           worker_device="/job:%s/task:%d" % (self.task.type, self.task.index),
484 |           cluster=self.cluster)
485 |     else:
486 |       target = ""
487 |       device_fn = ""
488 |     return (target, device_fn)
489 | 
490 |   def remove_training_directory(self, train_dir):
491 |     """Removes the training directory."""
492 |     try:
493 |       logging.info(
494 |           "%s: Removing existing train directory.",
495 |           task_as_string(self.task))
496 |       gfile.DeleteRecursively(train_dir)
497 |     except:
498 |       logging.error(
499 |           "%s: Failed to delete directory " + train_dir +
500 |           " when starting a new model. Please delete it manually and" +
501 |           " try again.", task_as_string(self.task))
502 | 
503 |   def get_meta_filename(self, start_new_model, train_dir):
504 |     if start_new_model:
505 |       logging.info("%s: Flag 'start_new_model' is set. Building a new model.",
506 |                    task_as_string(self.task))
507 |       return None
508 | 
509 |     latest_checkpoint = tf.train.latest_checkpoint(train_dir)
510 |     if not latest_checkpoint:
511 |       logging.info("%s: No checkpoint file found. Building a new model.",
512 |                    task_as_string(self.task))
513 |       return None
514 | 
515 |     meta_filename = latest_checkpoint + ".meta"
516 |     if not gfile.Exists(meta_filename):
517 |       logging.info("%s: No meta graph file found. Building a new model.",
518 |                      task_as_string(self.task))
519 |       return None
520 |     else:
521 |       return meta_filename
522 | 
523 |   def recover_model(self, meta_filename):
524 |     logging.info("%s: Restoring from meta graph file %s",
525 |                  task_as_string(self.task), meta_filename)
526 |     return tf.train.import_meta_graph(meta_filename)
527 | 
528 |   def build_model(self, model, reader):
529 |     """Find the model and build the graph."""
530 | 
531 |     label_loss_fn = find_class_by_name(FLAGS.label_loss, [losses])()
532 |     optimizer_class = find_class_by_name(FLAGS.optimizer, [tf.train])
533 | 
534 |     build_graph(reader=reader,
535 |                  model=model,
536 |                  optimizer_class=optimizer_class,
537 |                  clip_gradient_norm=FLAGS.clip_gradient_norm,
538 |                  train_data_pattern=FLAGS.train_data_pattern,
539 |                  label_loss_fn=label_loss_fn,
540 |                  base_learning_rate=FLAGS.base_learning_rate,
541 |                  learning_rate_decay=FLAGS.learning_rate_decay,
542 |                  learning_rate_decay_examples=FLAGS.learning_rate_decay_examples,
543 |                  regularization_penalty=FLAGS.regularization_penalty,
544 |                  num_readers=FLAGS.num_readers,
545 |                  batch_size=FLAGS.batch_size,
546 |                  num_epochs=FLAGS.num_epochs)
547 | 
548 |     return tf.train.Saver(max_to_keep=0, keep_checkpoint_every_n_hours=0.25)
549 | 
550 | 
551 | def get_reader():
552 |   reader = readers.MnistReader()
553 |   return reader
554 | 
555 | 
556 | class ParameterServer(object):
557 |   """A parameter server to serve variables in a distributed execution."""
558 | 
559 |   def __init__(self, cluster, task):
560 |     """Creates a ParameterServer.
561 | 
562 |     Args:
563 |       cluster: A tf.train.ClusterSpec if the execution is distributed.
564 |         None otherwise.
565 |       task: A TaskSpec describing the job type and the task index.
566 |     """
567 | 
568 |     self.cluster = cluster
569 |     self.task = task
570 | 
571 |   def run(self):
572 |     """Starts the parameter server."""
573 | 
574 |     logging.info("%s: Starting parameter server within cluster %s.",
575 |                  task_as_string(self.task), self.cluster.as_dict())
576 |     server = start_server(self.cluster, self.task)
577 |     server.join()
578 | 
579 | 
580 | def start_server(cluster, task):
581 |   """Creates a Server.
582 | 
583 |   Args:
584 |     cluster: A tf.train.ClusterSpec if the execution is distributed.
585 |       None otherwise.
586 |     task: A TaskSpec describing the job type and the task index.
587 |   """
588 | 
589 |   if not task.type:
590 |     raise ValueError("%s: The task type must be specified." %
591 |                      task_as_string(task))
592 |   if task.index is None:
593 |     raise ValueError("%s: The task index must be specified." %
594 |                      task_as_string(task))
595 | 
596 |   # Create and start a server.
597 |   return tf.train.Server(
598 |       tf.train.ClusterSpec(cluster),
599 |       protocol="grpc",
600 |       job_name=task.type,
601 |       task_index=task.index)
602 | 
603 | def task_as_string(task):
604 |   return "/job:%s/task:%s" % (task.type, task.index)
605 | 
606 | def main(unused_argv):
607 |   # Load the environment.
608 |   env = json.loads(os.environ.get("TF_CONFIG", "{}"))
609 | 
610 |   # Load the cluster data from the environment.
611 |   cluster_data = env.get("cluster", None)
612 |   cluster = tf.train.ClusterSpec(cluster_data) if cluster_data else None
613 | 
614 |   # Load the task data from the environment.
615 |   task_data = env.get("task", None) or {"type": "master", "index": 0}
616 |   task = type("TaskSpec", (object,), task_data)
617 | 
618 |   # Logging the version.
619 |   logging.set_verbosity(tf.logging.INFO)
620 |   logging.info("%s: Tensorflow version: %s.",
621 |                task_as_string(task), tf.__version__)
622 | 
623 |   # Dispatch to a master, a worker, or a parameter server.
624 |   if not cluster or task.type == "master" or task.type == "worker":
625 |     model = find_class_by_name(FLAGS.model,
626 |                                [mnist_models])()
627 | 
628 |     reader = get_reader()
629 |     model_exporter = export_model.ModelExporter(
630 |         model=model,
631 |         reader=reader)
632 | 
633 |     Trainer(cluster, task, FLAGS.train_dir, model, reader, model_exporter,
634 |             FLAGS.log_device_placement, FLAGS.max_steps,
635 |             FLAGS.export_model_steps).run(start_new_model=FLAGS.start_new_model)
636 | 
637 |   elif task.type == "ps":
638 |     ParameterServer(cluster, task).run()
639 |   else:
640 |     raise ValueError("%s: Invalid task_type: %s." %
641 |                      (task_as_string(task), task.type))
642 | 
643 | if __name__ == "__main__":
644 |   app.run()
645 | 


--------------------------------------------------------------------------------
/mnist/utils.py:
--------------------------------------------------------------------------------
  1 | # Copyright 2016 Google Inc. All Rights Reserved.
  2 | #
  3 | # Licensed under the Apache License, Version 2.0 (the "License");
  4 | # you may not use this file except in compliance with the License.
  5 | # You may obtain a copy of the License at
  6 | #
  7 | #      http://www.apache.org/licenses/LICENSE-2.0
  8 | #
  9 | # Unless required by applicable law or agreed to in writing, software
 10 | # distributed under the License is distributed on an "AS-IS" BASIS,
 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 12 | # See the License for the specific language governing permissions and
 13 | # limitations under the License.
 14 | 
 15 | """Contains a collection of util functions for training and evaluating.
 16 | """
 17 | 
 18 | import numpy
 19 | import tensorflow as tf
 20 | from tensorflow import logging
 21 | 
 22 | 
 23 | def Dequantize(feat_vector, max_quantized_value=2, min_quantized_value=-2):
 24 |   """Dequantize the feature from the byte format to the float format.
 25 | 
 26 |   Args:
 27 |     feat_vector: the input 1-d vector.
 28 |     max_quantized_value: the maximum of the quantized value.
 29 |     min_quantized_value: the minimum of the quantized value.
 30 | 
 31 |   Returns:
 32 |     A float vector which has the same shape as feat_vector.
 33 |   """
 34 |   assert max_quantized_value > min_quantized_value
 35 |   quantized_range = max_quantized_value - min_quantized_value
 36 |   scalar = quantized_range / 255.0
 37 |   bias = (quantized_range / 512.0) + min_quantized_value
 38 |   return feat_vector * scalar + bias
 39 | 
 40 | 
 41 | def MakeSummary(name, value):
 42 |   """Creates a tf.Summary proto with the given name and value."""
 43 |   summary = tf.Summary()
 44 |   val = summary.value.add()
 45 |   val.tag = str(name)
 46 |   val.simple_value = float(value)
 47 |   return summary
 48 | 
 49 | 
 50 | def AddGlobalStepSummary(summary_writer,
 51 |                          global_step_val,
 52 |                          global_step_info_dict,
 53 |                          summary_scope="Eval"):
 54 |   """Add the global_step summary to the Tensorboard.
 55 | 
 56 |   Args:
 57 |     summary_writer: Tensorflow summary_writer.
 58 |     global_step_val: a int value of the global step.
 59 |     global_step_info_dict: a dictionary of the evaluation metrics calculated for
 60 |       a mini-batch.
 61 |     summary_scope: Train or Eval.
 62 | 
 63 |   Returns:
 64 |     A string of this global_step summary
 65 |   """
 66 |   this_hit_at_one = global_step_info_dict["hit_at_one"]
 67 |   this_perr = global_step_info_dict["perr"]
 68 |   this_loss = global_step_info_dict["loss"]
 69 |   examples_per_second = global_step_info_dict.get("examples_per_second", -1)
 70 | 
 71 |   summary_writer.add_summary(
 72 |       MakeSummary("GlobalStep/" + summary_scope + "_Hit@1", this_hit_at_one),
 73 |       global_step_val)
 74 |   summary_writer.add_summary(
 75 |       MakeSummary("GlobalStep/" + summary_scope + "_Perr", this_perr),
 76 |       global_step_val)
 77 |   summary_writer.add_summary(
 78 |       MakeSummary("GlobalStep/" + summary_scope + "_Loss", this_loss),
 79 |       global_step_val)
 80 | 
 81 |   if examples_per_second != -1:
 82 |     summary_writer.add_summary(
 83 |         MakeSummary("GlobalStep/" + summary_scope + "_Example_Second",
 84 |                     examples_per_second), global_step_val)
 85 | 
 86 |   summary_writer.flush()
 87 |   info = ("global_step {0} | Batch Hit@1: {1:.3f} | Batch PERR: {2:.3f} | Batch Loss: {3:.3f} "
 88 |           "| Examples_per_sec: {4:.3f}").format(
 89 |               global_step_val, this_hit_at_one, this_perr, this_loss,
 90 |               examples_per_second)
 91 |   return info
 92 | 
 93 | 
 94 | def AddEpochSummary(summary_writer,
 95 |                     global_step_val,
 96 |                     epoch_info_dict,
 97 |                     summary_scope="Eval"):
 98 |   """Add the epoch summary to the Tensorboard.
 99 | 
100 |   Args:
101 |     summary_writer: Tensorflow summary_writer.
102 |     global_step_val: a int value of the global step.
103 |     epoch_info_dict: a dictionary of the evaluation metrics calculated for the
104 |       whole epoch.
105 |     summary_scope: Train or Eval.
106 | 
107 |   Returns:
108 |     A string of this global_step summary
109 |   """
110 |   epoch_id = epoch_info_dict["epoch_id"]
111 |   avg_hit_at_one = epoch_info_dict["avg_hit_at_one"]
112 |   avg_perr = epoch_info_dict["avg_perr"]
113 |   avg_loss = epoch_info_dict["avg_loss"]
114 |   aps = epoch_info_dict["aps"]
115 |   gap = epoch_info_dict["gap"]
116 |   mean_ap = numpy.mean(aps)
117 | 
118 |   summary_writer.add_summary(
119 |       MakeSummary("Epoch/" + summary_scope + "_Avg_Hit@1", avg_hit_at_one),
120 |       global_step_val)
121 |   summary_writer.add_summary(
122 |       MakeSummary("Epoch/" + summary_scope + "_Avg_Perr", avg_perr),
123 |       global_step_val)
124 |   summary_writer.add_summary(
125 |       MakeSummary("Epoch/" + summary_scope + "_Avg_Loss", avg_loss),
126 |       global_step_val)
127 |   summary_writer.add_summary(
128 |       MakeSummary("Epoch/" + summary_scope + "_MAP", mean_ap),
129 |           global_step_val)
130 |   summary_writer.add_summary(
131 |       MakeSummary("Epoch/" + summary_scope + "_GAP", gap),
132 |           global_step_val)
133 |   summary_writer.flush()
134 | 
135 |   info = ("epoch/eval number {0} | Avg_Hit@1: {1:.3f} | Avg_PERR: {2:.3f} "
136 |           "| MAP: {3:.3f} | GAP: {4:.3f} | Avg_Loss: {5:3f}").format(
137 |           epoch_id, avg_hit_at_one, avg_perr, mean_ap, gap, avg_loss)
138 |   return info
139 | 
140 | def GetListOfFeatureNamesAndSizes(feature_names, feature_sizes):
141 |   """Extract the list of feature names and the dimensionality of each feature
142 |      from string of comma separated values.
143 | 
144 |   Args:
145 |     feature_names: string containing comma separated list of feature names
146 |     feature_sizes: string containing comma separated list of feature sizes
147 | 
148 |   Returns:
149 |     List of the feature names and list of the dimensionality of each feature.
150 |     Elements in the first/second list are strings/integers.
151 |   """
152 |   list_of_feature_names = [
153 |       feature_names.strip() for feature_names in feature_names.split(',')]
154 |   list_of_feature_sizes = [
155 |       int(feature_sizes) for feature_sizes in feature_sizes.split(',')]
156 |   if len(list_of_feature_names) != len(list_of_feature_sizes):
157 |     logging.error("length of the feature names (=" +
158 |                   str(len(list_of_feature_names)) + ") != length of feature "
159 |                   "sizes (=" + str(len(list_of_feature_sizes)) + ")")
160 | 
161 |   return list_of_feature_names, list_of_feature_sizes
162 | 
163 | def clip_gradient_norms(gradients_to_variables, max_norm):
164 |   """Clips the gradients by the given value.
165 | 
166 |   Args:
167 |     gradients_to_variables: A list of gradient to variable pairs (tuples).
168 |     max_norm: the maximum norm value.
169 | 
170 |   Returns:
171 |     A list of clipped gradient to variable pairs.
172 |   """
173 |   clipped_grads_and_vars = []
174 |   for grad, var in gradients_to_variables:
175 |     if grad is not None:
176 |       if isinstance(grad, tf.IndexedSlices):
177 |         tmp = tf.clip_by_norm(grad.values, max_norm)
178 |         grad = tf.IndexedSlices(tmp, grad.indices, grad.dense_shape)
179 |       else:
180 |         grad = tf.clip_by_norm(grad, max_norm)
181 |     clipped_grads_and_vars.append((grad, var))
182 |   return clipped_grads_and_vars
183 | 
184 | def combine_gradients(tower_grads):
185 |   """Calculate the combined gradient for each shared variable across all towers.
186 | 
187 |   Note that this function provides a synchronization point across all towers.
188 | 
189 |   Args:
190 |     tower_grads: List of lists of (gradient, variable) tuples. The outer list
191 |       is over individual gradients. The inner list is over the gradient
192 |       calculation for each tower.
193 |   Returns:
194 |      List of pairs of (gradient, variable) where the gradient has been summed
195 |      across all towers.
196 |   """
197 |   filtered_grads = [[x for x in grad_list if x[0] is not None] for grad_list in tower_grads]
198 |   final_grads = []
199 |   for i in xrange(len(filtered_grads[0])):
200 |     grads = [filtered_grads[t][i] for t in xrange(len(filtered_grads))]
201 |     grad = tf.stack([x[0] for x in grads], 0)
202 |     grad = tf.reduce_sum(grad, 0)
203 |     final_grads.append((grad, filtered_grads[0][i][1],))
204 | 
205 |   return final_grads
206 | 


--------------------------------------------------------------------------------