├── puppy.jpg ├── figures ├── loss2.png ├── l2_loss2.png ├── accuracy3.png └── cross_entropy2.png ├── LICENSE-MIT ├── Contributing.md ├── preprocess.py ├── train.py ├── model.py ├── recipe_resnet50_imagenet.md ├── Code-Of-Conduct.md ├── README.md └── inference.ipynb /puppy.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yahoo/ml-reproducibility-guidelines/master/puppy.jpg -------------------------------------------------------------------------------- /figures/loss2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yahoo/ml-reproducibility-guidelines/master/figures/loss2.png -------------------------------------------------------------------------------- /figures/l2_loss2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yahoo/ml-reproducibility-guidelines/master/figures/l2_loss2.png -------------------------------------------------------------------------------- /figures/accuracy3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yahoo/ml-reproducibility-guidelines/master/figures/accuracy3.png -------------------------------------------------------------------------------- /figures/cross_entropy2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yahoo/ml-reproducibility-guidelines/master/figures/cross_entropy2.png -------------------------------------------------------------------------------- /LICENSE-MIT: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright 2018 Oath Inc. 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: 6 | 7 | The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. 8 | 9 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. -------------------------------------------------------------------------------- /Contributing.md: -------------------------------------------------------------------------------- 1 | # How to contribute 2 | First, thanks for taking the time to contribute to our project! The following information provides a guide for making contributions. 3 | 4 | ## Code of Conduct 5 | 6 | By participating in this project, you agree to abide by the [Oath Code of Conduct](Code-of-Conduct.md). Everyone is welcome to submit a pull request or open an issue to improve the documentation, add improvements, or report bugs. 7 | 8 | ## How to Ask a Question 9 | 10 | If you simply have a question that needs an answer, [create an issue](https://help.github.com/articles/creating-an-issue/), and label it as a question. 11 | 12 | ## How To Contribute 13 | 14 | ### Report a Bug or Request a Feature 15 | 16 | If you encounter any bugs while using this software, or want to request a new feature or enhancement, feel free to [create an issue](https://help.github.com/articles/creating-an-issue/) to report it, make sure you add a label to indicate what type of issue it is. 17 | 18 | ### Contribute Code 19 | - The following line must be included in your pull request: 20 | > I confirm that this contribution is made under the terms of the license found in the root directory of this repository's source tree and that I have the authority necessary to make this contribution on behalf of its copyright owner. 21 | 22 | Pull requests are welcome for bug fixes. If you want to implement something new, please [request a feature first](#report-a-bug-or-request-a-feature) so we can discuss it. 23 | 24 | #### Creating a Pull Request 25 | Please follow [best practices](https://github.com/trein/dev-best-practices/wiki/Git-Commit-Best-Practices) for creating git commits. 26 | 27 | When your code is ready to be submitted, you can [submit a pull request](https://help.github.com/articles/creating-a-pull-request/) to begin the code review process. 28 | -------------------------------------------------------------------------------- /preprocess.py: -------------------------------------------------------------------------------- 1 | // Copyright 2018 Oath Inc. 2 | // Licensed under the terms of the MIT license. Please see LICENSE file in project root for terms. 3 | 4 | from __future__ import absolute_import 5 | from __future__ import division 6 | from __future__ import print_function 7 | 8 | import tensorflow as tf 9 | 10 | _NUM_CHANNELS = 3 11 | _DEFAULT_IMAGE_SIZE = 224 12 | _RESIZE_SIDE_MIN = 256 13 | 14 | 15 | def _parse_function(example_proto, output_height, output_width, is_training=False, 16 | resize_side_min=_RESIZE_SIDE_MIN): 17 | """Parse TF Records and preprocess image and labels 18 | """ 19 | 20 | # parse the TFRecord 21 | feature_map = { 22 | "image/encoded": tf.FixedLenFeature((), tf.string, default_value=""), 23 | "image/format": tf.FixedLenFeature((), tf.string, default_value=""), 24 | "image/class/label": tf.VarLenFeature(tf.int64), 25 | "image/height": tf.FixedLenFeature((), tf.int64, default_value=0), 26 | "image/width": tf.FixedLenFeature((), tf.int64, default_value=0), 27 | "image/object/bbox/xmin": tf.VarLenFeature(tf.float32), 28 | "image/object/bbox/ymin": tf.VarLenFeature(tf.float32), 29 | "image/object/bbox/xmax": tf.VarLenFeature(tf.float32), 30 | "image/object/bbox/ymax": tf.VarLenFeature(tf.float32), 31 | } 32 | features = tf.parse_single_example(example_proto, feature_map) 33 | 34 | # parse bounding box 35 | xmin = tf.expand_dims(features['image/object/bbox/xmin'].values, 0) 36 | ymin = tf.expand_dims(features['image/object/bbox/ymin'].values, 0) 37 | xmax = tf.expand_dims(features['image/object/bbox/xmax'].values, 0) 38 | ymax = tf.expand_dims(features['image/object/bbox/ymax'].values, 0) 39 | 40 | bbox = tf.concat([ymin, xmin, ymax, xmax], 0) 41 | bbox = tf.expand_dims(bbox, 0) 42 | bbox = tf.transpose(bbox, [0, 2, 1]) 43 | 44 | # ImageNet preprocessing 45 | from official.resnet import imagenet_preprocessing 46 | image = imagenet_preprocessing.preprocess_image( 47 | image_buffer=features["image/encoded"], 48 | bbox=bbox, 49 | output_height=output_height, 50 | output_width=output_width, 51 | num_channels=_NUM_CHANNELS, 52 | is_training=is_training) 53 | 54 | # convert labels to dense 55 | labels = tf.sparse_tensor_to_dense(features["image/class/label"]) 56 | 57 | return image, labels 58 | 59 | 60 | def train_parse_function(example_proto): 61 | return _parse_function(example_proto, 62 | _DEFAULT_IMAGE_SIZE, 63 | _DEFAULT_IMAGE_SIZE, 64 | is_training=True) 65 | 66 | 67 | def eval_parse_function(example_proto): 68 | return _parse_function(example_proto, 69 | _DEFAULT_IMAGE_SIZE, 70 | _DEFAULT_IMAGE_SIZE, 71 | is_training=False) 72 | 73 | 74 | -------------------------------------------------------------------------------- /train.py: -------------------------------------------------------------------------------- 1 | // Copyright 2018 Oath Inc. 2 | // Licensed under the terms of the MIT license. Please see LICENSE file in project root for terms. 3 | 4 | from __future__ import absolute_import 5 | from __future__ import division 6 | from __future__ import print_function 7 | 8 | import os, sys 9 | from glob import glob 10 | 11 | import tensorflow as tf 12 | 13 | tf.logging.set_verbosity(tf.logging.INFO) 14 | 15 | tf.app.flags.DEFINE_string('dataset_directory', '/tmp/', 16 | 'Dataset directory') 17 | tf.app.flags.DEFINE_string('output_directory', '/tmp/', 18 | 'Output data directory') 19 | tf.app.flags.DEFINE_string('models_repository', '/tmp', 20 | 'Path to tensorflow/models repository') 21 | tf.app.flags.DEFINE_integer('num_gpus', 4, 22 | 'Number of GPU for distributed training') 23 | tf.app.flags.DEFINE_integer('batch_size', 256, 24 | 'Total batch size') 25 | 26 | FLAGS = tf.app.flags.FLAGS 27 | 28 | 29 | from model import model_fn 30 | 31 | 32 | def dataset_input_fn(filenames, batch_size, is_training): 33 | """ 34 | """ 35 | from preprocess import train_parse_function, eval_parse_function 36 | 37 | dataset = tf.data.TFRecordDataset(filenames) 38 | parse_function = train_parse_function if is_training else eval_parse_function 39 | dataset = dataset.map(parse_function, num_parallel_calls=32) 40 | dataset = dataset.batch(batch_size) 41 | dataset = dataset.prefetch(256) 42 | if is_training is True: 43 | dataset = dataset.repeat() 44 | 45 | return dataset 46 | 47 | 48 | 49 | def main(argv): 50 | """ 51 | """ 52 | 53 | # add path to tensorflow/models repository 54 | sys.path.append(FLAGS.models_repository) 55 | 56 | distribution_strategy = tf.contrib.distribute.MirroredStrategy(num_gpus=FLAGS.num_gpus) 57 | 58 | run_config = tf.estimator.RunConfig( 59 | train_distribute=distribution_strategy) 60 | 61 | # learning rate schedule 62 | lr_boundaries = [500000, 1000000] 63 | lr_values = [0.01, 0.001, 0.0001] 64 | 65 | estimator = tf.estimator.Estimator( 66 | model_fn=model_fn, 67 | model_dir=FLAGS.output_directory, 68 | params={'weight_decay': 1e-4, 69 | 'lr_boundaries': lr_boundaries, 70 | 'lr_values': lr_values}, 71 | config=run_config) 72 | 73 | batch_size_per_gpu = FLAGS.batch_size // FLAGS.num_gpus 74 | 75 | def train_input_fn(): 76 | train_filenames = glob(os.path.join(FLAGS.dataset_directory, "train*")) 77 | return dataset_input_fn(train_filenames, batch_size_per_gpu, True) 78 | 79 | def eval_input_fn(): 80 | eval_filenames = glob(os.path.join(FLAGS.dataset_directory, "validation*")) 81 | return dataset_input_fn(eval_filenames, 50, False) 82 | 83 | train_spec = tf.estimator.TrainSpec(input_fn=train_input_fn, max_steps=2000000) 84 | eval_spec = tf.estimator.EvalSpec(input_fn=eval_input_fn, steps=2000, 85 | start_delay_secs=600, 86 | throttle_secs=1800) 87 | 88 | tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec) 89 | 90 | if __name__ == "__main__": 91 | tf.app.run() -------------------------------------------------------------------------------- /model.py: -------------------------------------------------------------------------------- 1 | from __future__ import absolute_import 2 | from __future__ import division 3 | from __future__ import print_function 4 | 5 | import tensorflow as tf 6 | 7 | 8 | def model_fn(features, labels, mode, params): 9 | """ 10 | """ 11 | from official.resnet import resnet_model 12 | 13 | tf.summary.image('images', features, max_outputs=6) 14 | 15 | features = tf.cast(features, tf.float32) 16 | 17 | model = resnet_model.Model( 18 | resnet_size=50, 19 | bottleneck=True, 20 | num_classes=1000, 21 | num_filters=64, 22 | kernel_size=7, 23 | conv_stride=2, 24 | first_pool_size=3, 25 | first_pool_stride=2, 26 | block_sizes=[3, 4, 6, 3], 27 | block_strides=[1, 2, 2, 2], 28 | final_size=2048, 29 | resnet_version=2, 30 | data_format=None, 31 | dtype=tf.float32) 32 | 33 | logits = model(features, mode == tf.estimator.ModeKeys.TRAIN) 34 | logits = tf.cast(logits, tf.float32) 35 | 36 | predictions = { 37 | 'classes': tf.argmax(logits, axis=1) + 1, 38 | 'probabilities': tf.nn.softmax(logits, name='softmax_tensor') 39 | } 40 | 41 | if mode == tf.estimator.ModeKeys.PREDICT: 42 | return tf.estimator.EstimatorSpec( 43 | mode=mode, 44 | predictions=predictions, 45 | export_outputs={ 46 | 'predict': tf.estimator.export.PredictOutput(predictions) 47 | }) 48 | 49 | # calculate loss 50 | onehot_labels = tf.one_hot(tf.squeeze(labels - 1), 1000) 51 | cross_entropy = tf.losses.softmax_cross_entropy( 52 | onehot_labels=onehot_labels, logits=logits) 53 | 54 | tf.identity(cross_entropy, name='cross_entropy') 55 | tf.summary.scalar('cross_entropy', cross_entropy) 56 | 57 | def exclude_batch_norm(name): 58 | return 'batch_normalization' not in name 59 | loss_filter_fn = exclude_batch_norm 60 | 61 | # add weight decay 62 | l2_loss = params['weight_decay'] * tf.add_n( 63 | [tf.nn.l2_loss(tf.cast(v, tf.float32)) for v in tf.trainable_variables() 64 | if loss_filter_fn(v.name)]) 65 | tf.summary.scalar('l2_loss', l2_loss) 66 | 67 | loss = cross_entropy + l2_loss 68 | 69 | if mode == tf.estimator.ModeKeys.TRAIN: 70 | 71 | global_step = tf.train.get_or_create_global_step() 72 | 73 | learning_rate = tf.train.piecewise_constant(global_step, 74 | params['lr_boundaries'], params['lr_values']) 75 | 76 | tf.identity(learning_rate, name='learning_rate') 77 | tf.summary.scalar('learning_rate', learning_rate) 78 | 79 | optimizer = tf.train.MomentumOptimizer( 80 | learning_rate=learning_rate, 81 | momentum=0.9) 82 | 83 | minimize_op = optimizer.minimize(loss, global_step) 84 | 85 | update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS) 86 | train_op = tf.group(minimize_op, update_ops) 87 | 88 | eval_metric_ops = None 89 | 90 | else: 91 | 92 | train_op = None 93 | 94 | accuracy = tf.metrics.accuracy(labels, predictions['classes']) 95 | 96 | tf.identity(accuracy[1], name='train_accuracy') 97 | tf.summary.scalar('train_accuracy', accuracy[1]) 98 | 99 | eval_metric_ops = {'accuracy': accuracy} 100 | 101 | 102 | return tf.estimator.EstimatorSpec( 103 | mode=mode, 104 | predictions=predictions, 105 | loss=loss, 106 | train_op=train_op, 107 | eval_metric_ops=eval_metric_ops) -------------------------------------------------------------------------------- /recipe_resnet50_imagenet.md: -------------------------------------------------------------------------------- 1 | # Train ResNet-50 on ImageNet 2 | 3 | We provide below the recipe to train a ResNet-50 [model](https://www.cv-foundation.org/openaccess/content_cvpr_2016/html/He_Deep_Residual_Learning_CVPR_2016_paper.html) on the ImageNet dataset and achieve 74.05% single crop accuracy on the validation set. This recipe was based on models that were previously trained in [TensorFlow](https://github.com/tensorflow/models/tree/master/official/resnet) and [Caffe](https://github.com/KaimingHe/deep-residual-networks). 4 | 5 | ## Data 6 | 7 | ### Raw data 8 | 9 | #### Description 10 | 11 | The ImageNet dataset contains ~14M images that are classified into ~21K categories from the WordNet hierarchy. The classification task in the ImageNet Large Scale Visual Recognition Competition (ILSVRC) uses a subset of 1000 categories with ~1M associated images for training. Each image has a single label associated, and a bounding box with the object location may be available. 12 | 13 | #### Path 14 | Information on how to download ImageNet can be found on the dataset [website](http://image-net.org/). You need to sign up for an account in order to access the data. 15 | 16 | ### Data Processing Methods 17 | 18 | #### Description 19 | 20 | Images are downloaded and processed into the TFRecord format. The training data is sharded into 1024 training files with names `train-0????-of-01024` and 128 validation files with names `validation-0????-of-00128`. The classes are randomly distributed within each file. Data processing may take up to half a day. 21 | 22 | #### Code 23 | We follow the data processing methods from the Inception [guide](https://github.com/tensorflow/models/tree/master/research/inception#getting-started). We ran the code using commit `bd56a06d2815e3dc2914fc8ce2fa1ea60e19d324`. 24 | 25 | #### Path 26 | The processed data can be found in HDFS at `hdfs:/projects/ml/data/imagenet_bbox`. Contact the author for permissions and access. 27 | 28 | ## Training 29 | 30 | ### Optimization Methods 31 | 32 | #### Description 33 | We use the ResNet-50 architecture and the softmax loss. 34 | 35 | #### Code 36 | We ran our experiments using git commit `24b236fdb6ca449eb52e8f816e039e0d634015a5`. The code is written using TensorFlow and uses its high level [APIs](https://www.tensorflow.org/guide/estimators). After cloning the repository you can train the model by running 37 | 38 | ``` 39 | python train.py \ 40 | --dataset_directory=/path/to/data \ 41 | --output_directory=/path/to/output \ 42 | --models_repository=/path/to/repo 43 | ``` 44 | 45 | where 46 | 47 | * `dataset_directory` points to the training data. We expect a directory with the training and validation files with patterns `train-0????-of-01024` and `validation-0????-of-00128`. 48 | * `output_directory` is the directory where model checkpoints and logs are written. 49 | * `models_repository` points to the directory where you have cloned the TensorFlow models [repository](https://github.com/tensorflow/models) which contains the code to build the ResNet architecture. We use version `v1.10.0`. 50 | 51 | You can visualize the training by running TensorBoard `tensorboard --logdir /path/to/outputs`, where `logdir` points to the parent directory of `output_directory`. 52 | 53 | #### Hyperparameters 54 | 55 | We use the following hyperparameters: 56 | 57 | * Optimizer 58 | * Stochastic gradient descent (SGD) with momentum 0.9 59 | * Synchronized SGD across 4 GPUs 60 | * Batch size: 256, e.g. 64 per GPU 61 | * Learning rate starts at 0.1, decayed by a factor of 10 after 150000, 300000, 400000, and 450000 iterations 62 | * Regularization 63 | * Weight decay: 0.0001 64 | * Data augmentation 65 | * crop a distorted bounding box around the object (see [code](preprocess.py\#L59-L66) for details). 66 | * randomly flip the image left to right 67 | * resize the image to 224 x 224 x 3 68 | * subtract mean value in R, G, B channels (see [code](preprocess.py\#L59-L66) for values). 69 | 70 | #### Dynamics 71 | Training takes ~24 hours. You can visualize below the evolution of the total, cross-entropy, and l2 losses. 72 | 73 | ![loss](figures/loss2.png) 74 | ![cross_entropy_loss](figures/cross_entropy2.png) 75 | ![l2_loss](figures/l2_loss2.png) 76 | 77 | #### Outputs 78 | 79 | The model checkpoints and TensorFlow logs can be found on HDFS at `hdfs://projects/ml/recipe/resnet50_imagenet`. 80 | 81 | ### Performance metrics 82 | 83 | #### Target metrics 84 | 85 | During evaluation we resize the image such that the minimum edge, and run inference using the 224 x 224 x 3 center crop. The top-1 single crop accuracy is 74.0% after 500K iterations. 86 | 87 | #### Code 88 | 89 | The accuracy is computed by the training script automatically. It can be read using TensorBoard as shown below. 90 | 91 | ![accuracy_validation_set](figures/accuracy3.png) 92 | 93 | ## Inference 94 | 95 | #### Code 96 | We provide a [notebook](inference.ipynb) that provides an example of how to run inference on sample images. 97 | 98 | #### Timing information 99 | We measure the model latency using the inference notebook. We used a p3.2xlarge instance on AWS which has an NVIDIA GPU V100. The latency including preprocessing is 11ms with a batch size of 1. Note that the inference code doesn't run on a CPU because the channel ordering is NCHW and the max pooling operation only supports NHWC on a CPU device. We use TensorFlow v1.8. 100 | 101 | #### Serialization 102 | Not applicable. The model was trained for research purposes and is not deployed to production. 103 | 104 | ## Miscellaneous 105 | 106 | * Performance is 2\% below published results. We plan on running the training code from the TensorFlow models repository to determine whether there is a discrepancy in the code or in the training data 107 | * Using 8 GPUs did not increase the throughput. It is worth determining the bottleneck in the data pipeline in order to further parallelize the code and decrease the training time. The Stanford DAWN Deep Learning [Benchmark](https://dawn.cs.stanford.edu/benchmark/) provides relevant pointers regarding state-of-the-art training times for ImageNet. 108 | -------------------------------------------------------------------------------- /Code-Of-Conduct.md: -------------------------------------------------------------------------------- 1 | # Oath Open Source Code of Conduct 2 | 3 | ## Summary 4 | This Code of Conduct is our way to encourage good behavior and discourage bad behavior in our open source community. We invite participation from many people to bring different perspectives to support this project. We pledge to do our part to foster a welcoming and professional environment free of harassment. We expect participants to communicate professionally and thoughtfully during their involvement with this project. 5 | 6 | Participants may lose their good standing by engaging in misconduct. For example: insulting, threatening, or conveying unwelcome sexual content. We ask participants who observe conduct issues to report the incident directly to the project's Response Team at opensource-conduct@oath.com. Oath will assign a respondent to address the issue. We may remove harassers from this project. 7 | 8 | This code does not replace the terms of service or acceptable use policies of the websites used to support this project. We acknowledge that participants may be subject to additional conduct terms based on their employment which may govern their online expressions. 9 | 10 | ## Details 11 | This Code of Conduct makes our expectations of participants in this community explicit. 12 | * We forbid harassment and abusive speech within this community. 13 | * We request participants to report misconduct to the project’s Response Team. 14 | * We urge participants to refrain from using discussion forums to play out a fight. 15 | 16 | ### Expected Behaviors 17 | We expect participants in this community to conduct themselves professionally. Since our primary mode of communication is text on an online forum (e.g. issues, pull requests, comments, emails, or chats) devoid of vocal tone, gestures, or other context that is often vital to understanding, it is important that participants are attentive to their interaction style. 18 | 19 | * **Assume positive intent.** We ask community members to assume positive intent on the part of other people’s communications. We may disagree on details, but we expect all suggestions to be supportive of the community goals. 20 | * **Respect participants.** We expect participants will occasionally disagree. Even if we reject an idea, we welcome everyone’s participation. Open Source projects are learning experiences. Ask, explore, challenge, and then respectfully assert if you agree or disagree. If your idea is rejected, be more persuasive not bitter. 21 | * **Welcoming to new members.** New members bring new perspectives. Some may raise questions that have been addressed before. Kindly point them to existing discussions. Everyone is new to every project once. 22 | * **Be kind to beginners.** Beginners use open source projects to get experience. They might not be talented coders yet, and projects should not accept poor quality code. But we were all beginners once, and we need to engage kindly. 23 | * **Consider your impact on others.** Your work will be used by others, and you depend on the work of others. We expect community members to be considerate and establish a balance their self-interest with communal interest. 24 | * **Use words carefully.** We may not understand intent when you say something ironic. Poe’s Law suggests that without an emoticon people will misinterpret sarcasm. We ask community members to communicate plainly. 25 | * **Leave with class.** When you wish to resign from participating in this project for any reason, you are free to fork the code and create a competitive project. Open Source explicitly allows this. Your exit should not be dramatic or bitter. 26 | 27 | ### Unacceptable Behaviors 28 | Participants remain in good standing when they do not engage in misconduct or harassment. To elaborate: 29 | * **Don't be a bigot.** Calling out project members by their identity or background in a negative or insulting manner. This includes, but is not limited to, slurs or insinuations related to protected or suspect classes e.g. race, color, citizenship, national origin, political belief, religion, sexual orientation, gender identity and expression, age, size, culture, ethnicity, genetic features, language, profession, national minority statue, mental or physical ability. 30 | * **Don't insult.** Insulting remarks about a person’s lifestyle practices. 31 | * **Don't dox.** Revealing private information about other participants without explicit permission. 32 | * **Don't intimidate.** Threats of violence or intimidation of any project member. 33 | * **Don't creep.** Unwanted sexual attention or content unsuited for the subject of this project. 34 | * **Don't disrupt.** Sustained disruptions in a discussion. 35 | * **Let us help.** Refusal to assist the Response Team to resolve an issue in the community. 36 | 37 | We do not list all forms of harassment, nor imply some forms of harassment are not worthy of action. Any participant who *feels* harassed or *observes* harassment, should report the incident. Victim of harassment should not address grievances in the public forum, as this often intensifies the problem. Report it, and let us address it off-line. 38 | 39 | ### Reporting Issues 40 | If you experience or witness misconduct, or have any other concerns about the conduct of members of this project, please report it by contacting our Response Team at opensource-conduct@oath.com who will handle your report with discretion. Your report should include: 41 | * Your preferred contact information. We cannot process anonymous reports. 42 | * Names (real or usernames) of those involved in the incident. 43 | * Your account of what occurred, and if the incident is ongoing. Please provide links to or transcripts of the publicly available records (e.g. a mailing list archive or a public IRC logger), so that we can review it. 44 | * Any additional information that may be helpful to achieve resolution. 45 | 46 | After filing a report, a representative will contact you directly to review the incident and ask additional questions. If a member of the Oath Response Team is named in an incident report, that member will be recused from handling your incident. If the complaint originates from a member of the Response Team, it will be addressed by a different member of the Response Team. We will consider reports to be confidential for the purpose of protecting victims of abuse. 47 | 48 | ### Scope 49 | Oath will assign a Response Team member with admin rights on the project and legal rights on the project copyright. The Response Team is empowered to restrict some privileges to the project as needed. Since this project is governed by an open source license, any participant may fork the code under the terms of the project license. The Response Team’s goal is to preserve the project if possible, and will restrict or remove participation from those who disrupt the project. 50 | 51 | This code does not replace the terms of service or acceptable use policies that are provided by the websites used to support this community. Nor does this code apply to communications or actions that take place outside of the context of this community. Many participants in this project are also subject to codes of conduct based on their employment. This code is a social-contract that informs participants of our social expectations. It is not a terms of service or legal contract. 52 | 53 | ## License and Acknowledgment. 54 | This text is shared under the [CC-BY-4.0 license](https://creativecommons.org/licenses/by/4.0/). This code is based on a study conducted by the [TODO Group](https://todogroup.org/) of many codes used in the open source community. If you have feedback about this code, contact our Response Team at the address listed above. -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Training Recipes for Reproducible Machine Learning Models 2 | 3 | > Technical documentation guidelines to improve the reproduciblity of machine learning models. 4 | 5 | Did you build a machine learning model and want to make sure you're not the only one who knows how to train it? This repository contains guidelines for how to write technical documentation that makes it easier for others to reproduce the results of your model. 6 | 7 | ## Table of Contents 8 | - [Background](#background) 9 | - [Usage](#usage) 10 | - [Data](#data) 11 | - [Training](#training) 12 | - [Contribute](#contribute) 13 | - [License](#license) 14 | 15 | ## Background 16 | 17 | Reproducing the results of a machine learning model is notoriously difficult. This is particularly true in the deep learning community, where high-dimensional, over-parameterized non-convex optimization problems require multiple heuristics to converge to local minima with good performance. Hence, if the optimization or source of training data are not properly documented, it can take considerable experimentation to achieve the expected results. 18 | 19 | We propose the training recipe, a technical document whose aim is to provide sufficient information for a researcher to train a model and achieve a target performance without requiring any external help. We recommend to write training recipes for models that are deployed to production, models that are published, or models that reproduce results from academic papers. As we are well aware that writing technical documentation can be cumbersome, we provide simple guidelines that make it possible to produce the recipe in a timely fashion. We recommend to write the recipe as a markdown document and to review it with a Github pull request. 20 | 21 | ## Usage 22 | 23 | Here are the components of the training recipe: 24 | - Data 25 | - Raw Data 26 | - Data Processing Methods 27 | - Training 28 | - Optimization Methods 29 | - Performance Metrics 30 | - Inference 31 | 32 | The next section will provide a checklist of critical items for each of these and will explain the rationale for their importance. 33 | 34 | ### Data 35 | 36 | #### Raw Data 37 | 38 | **Description** 39 | 40 | Describe the data, labels, as well as additional available meta-data. It is important to understand the data schema when adding new data samples. Furthermore, while developing machine learning models one may identify biases or peculiar behaviors that can be caused by how the training data was generated. Information about its source makes it possible to verify hypotheses and make necessary adjustments. 41 | 42 | **Path** 43 | 44 | Provide information about how to access the raw data, e.g. a website, a set of Hive tables, an Hadoop Distributed File System (HDFS) or S3 URI, etc... Make sure that you respect the data governance if access is restricted. 45 | 46 | #### Data Processing Methods 47 | 48 | **Description** 49 | 50 | Describe the data processing pipeline. The raw dataset is often in a format that is not appropriate for running the model training script directly. Here are some examples of common data processing methods: 51 | 52 | * The data is image URLs and images are downloaded. 53 | * The data and labels are contained in separate Hive tables which are joined. 54 | * The data is split into training, validation, and test sets. As performance is evaluated on the test set and is the ultimate metric to determine reproducibility, it is critical to have detailed information on how to build the test set. 55 | * The class distribution is highly skewed and the dataset is balanced in order for the classifier to better learn the rare classes. 56 | * For natural language data a dictionary of fixed size is computed and words are mapped to indices. 57 | 58 | **Code** 59 | 60 | Provide the code and instructions to run the data processing script. Include the git commit SHA-1 hash. 61 | 62 | **Path** 63 | 64 | Include a link to the processed data. It makes it possible to train the model without running the preprocessing scripts which may take a long time. Note that data governance such as GDPR may not allow to keep a cache of the processed data and therefore the dataset should be reprocessed whenever we train a new model. Make sure that you respect the data governance if you can provide a link and access is restricted. 65 | 66 | ### Training 67 | 68 | #### Optimization Methods 69 | 70 | **Description** 71 | 72 | Describe the model architecture and the loss function. 73 | 74 | **Code** 75 | 76 | Provide the code and instructions to run the training script. Include the git commit SHA-1 hash. 77 | 78 | **Hyperparameters** 79 | 80 | Provide details about the optimization hyperparameters, such as 81 | 82 | * Batch size 83 | * Number of GPUs 84 | * Optimizer information and learning rate schedule, e.g. stochastic gradient descent, momentum, Adam, RMSProp, etc... 85 | * Location of the pre-trained model (if the model is fine-tuned) 86 | * Data augmentation methods 87 | 88 | **Dynamics** 89 | Describe the training dynamics, such as the total training time, the evolution of the loss function on the training and validation sets, or the evolution of other relevant metrics such as accuracy. As training a model to completion may take several days, the dynamics provide a way to get earlier feedback about whether we are "on track" to reproduce performance. 90 | 91 | **Outputs** 92 | 93 | Provide a link to the trained model, e.g. URL, HDFS or S3 URI, etc... 94 | 95 | #### Performance metrics 96 | 97 | **Target metrics** 98 | Provide target metrics, such as top-K accuracy, mean average precision, BLEU score, etc... 99 | 100 | **Code** 101 | 102 | Provide the code and instructions to run the evaluation script. Include the git commit SHA-1 hash. 103 | 104 | #### Inference 105 | 106 | **Code** 107 | 108 | Provide an example of how to run the model on a new data sample. This demonstrates how to combine the data processing steps and model inference, which is helpful for model deployment. It also makes it possible to explore the model's predictions interactively and get insight into its performance beyond the metrics provided above. We recommend using Jupyter Notebooks to show examples on how to run the inference. 109 | 110 | **Timing information** 111 | 112 | Provide timing information along with relevant details such as the hardware (e.g. NVIDIA Tesla V100, Intel Core i7, ...), batch size, software version, etc... Having access to these numbers makes it easier to plan for capacity and discuss product integrations. 113 | 114 | **Serialization** 115 | 116 | Provide the code and instructions to serialize the model. Include the git commit SHA-1. Note that this is only required for models that are deployed to production in a format that differs from the output of the training script, e.g. ONNX, TFLite. 117 | 118 | 119 | #### Miscellaneous 120 | 121 | This section contains additional relevant information such as academic papers, experimental journal and notes detailing the experiments performed to reach the best performance, list of action items to improve the model, the code, etc... 122 | 123 | ## Example 124 | 125 | Please refer to the training [recipe](recipe_resnet50_imagenet.md) for an example of how to train a ResNet-50 model on ImageNet. 126 | 127 | ## Contribute 128 | 129 | Please refer to [the contributing.md file](Contributing.md) for information about how to get involved. We welcome issues, questions, and pull requests. Pull Requests are welcome. 130 | 131 | ## Maintainers 132 | Pierre Garrigues: garp@oath.com 133 | 134 | ## License 135 | 136 | This project is licensed under the terms of the [MIT](LICENSE-MIT) open source license. 137 | -------------------------------------------------------------------------------- /inference.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Inference notebook for the ResNet-50 model trained on ImageNet\n", 8 | "\n", 9 | "This notebook demonstrates inference with a ResNet-50 model trained on the ImageNet dataset. In this notebook you will be able to run inference on some sample images and measure the latency of the model\n", 10 | "\n", 11 | "## First let's set some variables\n", 12 | "\n", 13 | "Fill in below the values for the path to your code and modle checkpoint, which GPU you will be using (specify -1 to use the CPU). You can refer to the training recipe to access a checkpoint if you haven't trained the model yourself." 14 | ] 15 | }, 16 | { 17 | "cell_type": "code", 18 | "execution_count": 1, 19 | "metadata": {}, 20 | "outputs": [ 21 | { 22 | "name": "stderr", 23 | "output_type": "stream", 24 | "text": [ 25 | "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.\n", 26 | " from ._conv import register_converters as _register_converters\n" 27 | ] 28 | } 29 | ], 30 | "source": [ 31 | "# specify the gpu to use\n", 32 | "import os\n", 33 | "os.environ['CUDA_VISIBLE_DEVICES'] = '0'\n", 34 | "\n", 35 | "# path to tensorflow/models repository\n", 36 | "import sys\n", 37 | "sys.path.append(\"/ebs/code/models/\")\n", 38 | "\n", 39 | "import tensorflow as tf\n", 40 | "import numpy as np" 41 | ] 42 | }, 43 | { 44 | "cell_type": "code", 45 | "execution_count": 2, 46 | "metadata": {}, 47 | "outputs": [ 48 | { 49 | "name": "stdout", 50 | "output_type": "stream", 51 | "text": [ 52 | "loading checkpoint ./output-v4/model.ckpt-500000\n" 53 | ] 54 | } 55 | ], 56 | "source": [ 57 | "CHECKPOINT_PATH = \"./output-v4/\"\n", 58 | "# select checkpoint# selec \n", 59 | "checkpoint_file = os.path.join(CHECKPOINT_PATH, 'checkpoint')\n", 60 | "with open(checkpoint_file,'r') as chkf:\n", 61 | " last_chkpnt = chkf.readline().split(' ')[1][1:-2]\n", 62 | " last_chkpnt = os.path.join(CHECKPOINT_PATH, last_chkpnt)\n", 63 | " print(\"loading checkpoint\", last_chkpnt)" 64 | ] 65 | }, 66 | { 67 | "cell_type": "markdown", 68 | "metadata": {}, 69 | "source": [ 70 | "## Load the model" 71 | ] 72 | }, 73 | { 74 | "cell_type": "code", 75 | "execution_count": 3, 76 | "metadata": {}, 77 | "outputs": [], 78 | "source": [ 79 | "from official.resnet import resnet_model\n", 80 | "from preprocess import preprocess_for_eval, _DEFAULT_IMAGE_SIZE, _RESIZE_SIDE_MIN\n", 81 | "\n", 82 | "image_data = tf.placeholder(tf.string)\n", 83 | "\n", 84 | "features = preprocess_for_eval(image_data, _DEFAULT_IMAGE_SIZE, _DEFAULT_IMAGE_SIZE, _RESIZE_SIDE_MIN)\n", 85 | "features = tf.expand_dims(features, 0)\n", 86 | "\n", 87 | "model = resnet_model.Model(\n", 88 | " resnet_size=50,\n", 89 | " bottleneck=True,\n", 90 | " num_classes=1000,\n", 91 | " num_filters=64,\n", 92 | " kernel_size=7,\n", 93 | " conv_stride=2,\n", 94 | " first_pool_size=3,\n", 95 | " first_pool_stride=2,\n", 96 | " block_sizes=[3, 4, 6, 3],\n", 97 | " block_strides=[1, 2, 2, 2],\n", 98 | " final_size=2048,\n", 99 | " resnet_version=2,\n", 100 | " data_format=None,\n", 101 | " dtype=tf.float32)\n", 102 | "\n", 103 | "logits = model(features, False)\n" 104 | ] 105 | }, 106 | { 107 | "cell_type": "code", 108 | "execution_count": 4, 109 | "metadata": {}, 110 | "outputs": [ 111 | { 112 | "name": "stdout", 113 | "output_type": "stream", 114 | "text": [ 115 | "INFO:tensorflow:Restoring parameters from ./output-v4/model.ckpt-500000\n" 116 | ] 117 | } 118 | ], 119 | "source": [ 120 | "sess = tf.Session()\n", 121 | "saver = tf.train.Saver()\n", 122 | "saver.restore(sess, last_chkpnt)" 123 | ] 124 | }, 125 | { 126 | "cell_type": "code", 127 | "execution_count": 5, 128 | "metadata": {}, 129 | "outputs": [], 130 | "source": [ 131 | "# load imagenet information\n", 132 | "\n", 133 | "with open(\"/ebs/code/models/research/inception/inception/data/imagenet_lsvrc_2015_synsets.txt\", \"r\") as fh:\n", 134 | " synsets = [l.strip() for l in fh]\n", 135 | " \n", 136 | "with open(\"/ebs/code/models/research/inception/inception/data/imagenet_metadata.txt\", \"r\") as fh:\n", 137 | " synset2name = dict([l.strip().split(\"\\t\") for l in fh])\n" 138 | ] 139 | }, 140 | { 141 | "cell_type": "markdown", 142 | "metadata": {}, 143 | "source": [ 144 | "## Run inference on sample images\n", 145 | "\n", 146 | "Modify the filename below to try out the model on some sample images. We will show the topK predictions of the model." 147 | ] 148 | }, 149 | { 150 | "cell_type": "code", 151 | "execution_count": 6, 152 | "metadata": {}, 153 | "outputs": [], 154 | "source": [ 155 | "IMAGE_FILENAME = \"./puppy.jpg\"" 156 | ] 157 | }, 158 | { 159 | "cell_type": "code", 160 | "execution_count": 7, 161 | "metadata": {}, 162 | "outputs": [ 163 | { 164 | "data": { 165 | "image/jpeg": "\n", 166 | "text/plain": [ 167 | "" 168 | ] 169 | }, 170 | "metadata": { 171 | "image/jpeg": { 172 | "width": 500 173 | } 174 | }, 175 | "output_type": "display_data" 176 | }, 177 | { 178 | "name": "stdout", 179 | "output_type": "stream", 180 | "text": [ 181 | "15.06\tLabrador retriever\n", 182 | "13.18\tChesapeake Bay retriever\n", 183 | "9.48\tSamoyed, Samoyede\n", 184 | "9.25\tthatch, thatched roof\n", 185 | "8.81\tschipperke\n" 186 | ] 187 | } 188 | ], 189 | "source": [ 190 | "from IPython.display import Image, display\n", 191 | "display(Image(filename=IMAGE_FILENAME, width=500))\n", 192 | "\n", 193 | "with open(IMAGE_FILENAME, \"rb\") as fh:\n", 194 | " x = fh.read()\n", 195 | "r = sess.run([logits], feed_dict={image_data: x})\n", 196 | "l = r[0].flatten()\n", 197 | "\n", 198 | "for i in np.argsort(l)[::-1][:5]:\n", 199 | " print('%.2f\\t%s' % (l[i], synset2name[synsets[i+1]]))" 200 | ] 201 | }, 202 | { 203 | "cell_type": "markdown", 204 | "metadata": {}, 205 | "source": [ 206 | "## Measure the model latency\n", 207 | "\n", 208 | "We measure how long it takes to run inference on the model with a batch size of 1. This scenario is close to how the model is deployed in production. In order to benchmark performance on the CPU, you can set the CUDA_VISIBLE_DEVICES environment variable to -1.\n", 209 | "\n", 210 | "Measurements on an AWS p3.2xlarge machine\n", 211 | "* GPU: avg time = 11ms (+/- 0ms)" 212 | ] 213 | }, 214 | { 215 | "cell_type": "code", 216 | "execution_count": 8, 217 | "metadata": {}, 218 | "outputs": [ 219 | { 220 | "name": "stdout", 221 | "output_type": "stream", 222 | "text": [ 223 | "avg time = 11ms (+/- 0ms)\n" 224 | ] 225 | } 226 | ], 227 | "source": [ 228 | "from time import time as now\n", 229 | "\n", 230 | "num_steps = 20\n", 231 | "num_steps_burn_in = 5\n", 232 | "\n", 233 | "durations = []\n", 234 | "for i in range(num_steps + num_steps_burn_in):\n", 235 | " start_time = now()\n", 236 | " _ = sess.run([logits], feed_dict={image_data: x})\n", 237 | " duration = now() - start_time\n", 238 | " if i >= num_steps_burn_in:\n", 239 | " durations.append(duration)\n", 240 | "durations = np.array(durations)\n", 241 | "\n", 242 | "print(\"avg time = %dms (+/- %dms)\" % (1000*durations.mean(), 1000*durations.std()))" 243 | ] 244 | }, 245 | { 246 | "cell_type": "code", 247 | "execution_count": null, 248 | "metadata": {}, 249 | "outputs": [], 250 | "source": [] 251 | } 252 | ], 253 | "metadata": { 254 | "kernelspec": { 255 | "display_name": "Environment (conda_tensorflow_p36)", 256 | "language": "python", 257 | "name": "conda_tensorflow_p36" 258 | }, 259 | "language_info": { 260 | "codemirror_mode": { 261 | "name": "ipython", 262 | "version": 3 263 | }, 264 | "file_extension": ".py", 265 | "mimetype": "text/x-python", 266 | "name": "python", 267 | "nbconvert_exporter": "python", 268 | "pygments_lexer": "ipython3", 269 | "version": "3.6.6" 270 | } 271 | }, 272 | "nbformat": 4, 273 | "nbformat_minor": 2 274 | } 275 | --------------------------------------------------------------------------------