├── .gitignore ├── README.md ├── loss.py ├── model_builder.py ├── retinanet.py ├── retinanet_50_train.config ├── retinanet_feature_extractor.py └── train.sh /.gitignore: -------------------------------------------------------------------------------- 1 | # Created by .ignore support plugin (hsz.mobi) 2 | *.pyc 3 | .idea 4 | model1 5 | data -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ## RetinaNet tensorflow version 2 | Unofficial realization of [retinanet](https://arxiv.org/abs/1708.02002) using tf. **NOTE** this project is written for practice, so please don't hesitate to report an issue if you find something run. 3 | 4 | TF models [object detection api](https://github.com/tensorflow/models/tree/master/research/object_detection) have integrated FPN in this framework, and ssd_resnet50_v1_fpn is the synonym of RetinaNet. You could dig into ssd_resnet50_v1_feature_extractor in `models` for coding details. 5 | 6 | Since this work depends on tf in the beginning, I keep only retinanet backbone, loss and customed retinanet_feature_extractor in standard format. To make it work, here are the steps: 7 | - Download tensorflow [models](https://github.com/tensorflow/models) and install object detection api following [this way](https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/installation.md). 8 | - Add retinanet feature extractor to `model_builder.py`: 9 | ```python3 10 | from object_detection.models.retinanet_feature_extractor import RetinaNet50FeatureExtractor, RetinaNet101FeatureExtractor 11 | 12 | SSD_FEATURE_EXTRACTOR_CLASS_MAP = { 13 | ... 14 | 'retinanet_50': RetinaNet50FeatureExtractor, 15 | 'retinanet_101': RetinaNet101FeatureExtractor, 16 | } 17 | ``` 18 | - Put `retinanet_feature_extractor.py` and `retinanet.py` under `models` 19 | - Modify `retinanet_50_train.config` and `train.sh` with your customed settings and data inputs. Then run `train.sh` to start training. -------------------------------------------------------------------------------- /loss.py: -------------------------------------------------------------------------------- 1 | import tensorflow as tf 2 | 3 | 4 | def focal_loss(logits, onehot_labels, weights, alpha=0.25, gamma=2.0): 5 | """ 6 | Compute sigmoid focal loss between logits and onehot labels: focal loss = -(1-pt)^gamma*log(pt) 7 | 8 | Args: 9 | onehot_labels: onehot labels with shape (batch_size, num_anchors, num_classes) 10 | logits: last layer feature output with shape (batch_size, num_anchors, num_classes) 11 | weights: weight tensor returned from target assigner with shape [batch_size, num_anchors] 12 | alpha: The hyperparameter for adjusting biased samples, default is 0.25 13 | gamma: The hyperparameter for penalizing the easy labeled samples, default is 2.0 14 | Returns: 15 | a scalar of focal loss of total classification 16 | """ 17 | with tf.name_scope("focal_loss"): 18 | logits = tf.cast(logits, tf.float32) 19 | onehot_labels = tf.cast(onehot_labels, tf.float32) 20 | ce = tf.nn.sigmoid_cross_entropy_with_logits(labels=onehot_labels, logits=logits) 21 | predictions = tf.sigmoid(logits) 22 | predictions_pt = tf.where(tf.equal(onehot_labels, 1), predictions, 1.-predictions) 23 | # add small value to avoid 0 24 | alpha_t = tf.scalar_mul(alpha, tf.ones_like(onehot_labels, dtype=tf.float32)) 25 | alpha_t = tf.where(tf.equal(onehot_labels, 1.0), alpha_t, 1-alpha_t) 26 | weighted_loss = ce * tf.pow(1-predictions_pt, gamma) * alpha_t * tf.expand_dims(weights, axis=2) 27 | return tf.reduce_sum(weighted_loss) 28 | 29 | 30 | def regression_loss(pred_boxes, gt_boxes, weights, delta=1.0): 31 | """ 32 | Regression loss (Smooth L1 loss: also known as huber loss) 33 | 34 | Args: 35 | pred_boxes: [batch_size, num_anchors, 4] 36 | gt_boxes: [batch_size, num_anchors, 4] 37 | weights: Tensor of weights multiplied by loss with shape [batch_size, num_anchors] 38 | delta: delta for smooth L1 loss 39 | Returns: 40 | a box regression loss scalar 41 | """ 42 | loss = tf.reduce_sum(tf.losses.huber_loss(predictions=pred_boxes, 43 | labels=gt_boxes, 44 | delta=delta, 45 | weights=tf.expand_dims(weights, axis=2), 46 | scope='box_loss', 47 | reduction=tf.losses.Reduction.NONE)) 48 | return loss 49 | 50 | 51 | -------------------------------------------------------------------------------- /model_builder.py: -------------------------------------------------------------------------------- 1 | # Copyright 2017 The TensorFlow Authors. All Rights Reserved. 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # http://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | # ============================================================================== 15 | 16 | """A function to build a DetectionModel from configuration.""" 17 | 18 | import functools 19 | from object_detection.builders import anchor_generator_builder 20 | from object_detection.builders import box_coder_builder 21 | from object_detection.builders import box_predictor_builder 22 | from object_detection.builders import hyperparams_builder 23 | from object_detection.builders import image_resizer_builder 24 | from object_detection.builders import losses_builder 25 | from object_detection.builders import matcher_builder 26 | from object_detection.builders import post_processing_builder 27 | from object_detection.builders import region_similarity_calculator_builder as sim_calc 28 | from object_detection.core import balanced_positive_negative_sampler as sampler 29 | from object_detection.core import post_processing 30 | from object_detection.core import target_assigner 31 | from object_detection.meta_architectures import faster_rcnn_meta_arch 32 | from object_detection.meta_architectures import rfcn_meta_arch 33 | from object_detection.meta_architectures import ssd_meta_arch 34 | from object_detection.meta_architectures import ssd_meta_arch_keras 35 | from object_detection.models import faster_rcnn_inception_resnet_v2_feature_extractor as frcnn_inc_res 36 | from object_detection.models import faster_rcnn_inception_resnet_v2_keras_feature_extractor as frcnn_inc_res_keras 37 | from object_detection.models import faster_rcnn_inception_v2_feature_extractor as frcnn_inc_v2 38 | from object_detection.models import faster_rcnn_nas_feature_extractor as frcnn_nas 39 | from object_detection.models import faster_rcnn_pnas_feature_extractor as frcnn_pnas 40 | from object_detection.models import faster_rcnn_resnet_v1_feature_extractor as frcnn_resnet_v1 41 | from object_detection.models import ssd_resnet_v1_fpn_feature_extractor as ssd_resnet_v1_fpn 42 | from object_detection.models import ssd_resnet_v1_ppn_feature_extractor as ssd_resnet_v1_ppn 43 | from object_detection.models.embedded_ssd_mobilenet_v1_feature_extractor import EmbeddedSSDMobileNetV1FeatureExtractor 44 | from object_detection.models.ssd_inception_v2_feature_extractor import SSDInceptionV2FeatureExtractor 45 | from object_detection.models.ssd_inception_v3_feature_extractor import SSDInceptionV3FeatureExtractor 46 | from object_detection.models.ssd_mobilenet_v1_feature_extractor import SSDMobileNetV1FeatureExtractor 47 | from object_detection.models.ssd_mobilenet_v1_fpn_feature_extractor import SSDMobileNetV1FpnFeatureExtractor 48 | from object_detection.models.ssd_mobilenet_v1_fpn_keras_feature_extractor import SSDMobileNetV1FpnKerasFeatureExtractor 49 | from object_detection.models.ssd_mobilenet_v1_keras_feature_extractor import SSDMobileNetV1KerasFeatureExtractor 50 | from object_detection.models.ssd_mobilenet_v1_ppn_feature_extractor import SSDMobileNetV1PpnFeatureExtractor 51 | from object_detection.models.ssd_mobilenet_v2_feature_extractor import SSDMobileNetV2FeatureExtractor 52 | from object_detection.models.ssd_mobilenet_v2_fpn_feature_extractor import SSDMobileNetV2FpnFeatureExtractor 53 | from object_detection.models.ssd_mobilenet_v2_fpn_keras_feature_extractor import SSDMobileNetV2FpnKerasFeatureExtractor 54 | from object_detection.models.ssd_mobilenet_v2_keras_feature_extractor import SSDMobileNetV2KerasFeatureExtractor 55 | from object_detection.models.ssd_pnasnet_feature_extractor import SSDPNASNetFeatureExtractor 56 | from object_detection.models.ssd_antialiased_resnet_v1_fpn_feature_extractor import SSDAntResnet50V1FpnFeatureExtractor 57 | from object_detection.models.retinanet_feature_extractor import RetinaNet50FeatureExtractor, RetinaNet101FeatureExtractor 58 | from object_detection.predictors import rfcn_box_predictor 59 | from object_detection.predictors import rfcn_keras_box_predictor 60 | from object_detection.predictors.heads import mask_head 61 | from object_detection.protos import model_pb2 62 | from object_detection.utils import ops 63 | 64 | # A map of names to SSD feature extractors. 65 | SSD_FEATURE_EXTRACTOR_CLASS_MAP = { 66 | 'ssd_inception_v2': SSDInceptionV2FeatureExtractor, 67 | 'ssd_inception_v3': SSDInceptionV3FeatureExtractor, 68 | 'ssd_mobilenet_v1': SSDMobileNetV1FeatureExtractor, 69 | 'ssd_mobilenet_v1_fpn': SSDMobileNetV1FpnFeatureExtractor, 70 | 'ssd_mobilenet_v1_ppn': SSDMobileNetV1PpnFeatureExtractor, 71 | 'ssd_mobilenet_v2': SSDMobileNetV2FeatureExtractor, 72 | 'ssd_mobilenet_v2_fpn': SSDMobileNetV2FpnFeatureExtractor, 73 | 'ssd_resnet18_v1_fpn': ssd_resnet_v1_fpn.SSDResnet18V1FpnFeatureExtractor, 74 | 'ssd_resnet22_v1_fpn': ssd_resnet_v1_fpn.SSDResnet22V1FpnFeatureExtractor, 75 | 'ssd_resnet50_v1_fpn': ssd_resnet_v1_fpn.SSDResnet50V1FpnFeatureExtractor, 76 | 'ssd_resnet101_v1_fpn': ssd_resnet_v1_fpn.SSDResnet101V1FpnFeatureExtractor, 77 | 'ssd_resnet152_v1_fpn': ssd_resnet_v1_fpn.SSDResnet152V1FpnFeatureExtractor, 78 | 'ssd_resnet50_v1_ppn': ssd_resnet_v1_ppn.SSDResnet50V1PpnFeatureExtractor, 79 | 'ssd_resnet101_v1_ppn': 80 | ssd_resnet_v1_ppn.SSDResnet101V1PpnFeatureExtractor, 81 | 'ssd_resnet152_v1_ppn': 82 | ssd_resnet_v1_ppn.SSDResnet152V1PpnFeatureExtractor, 83 | 'embedded_ssd_mobilenet_v1': EmbeddedSSDMobileNetV1FeatureExtractor, 84 | 'ssd_pnasnet': SSDPNASNetFeatureExtractor, 85 | 'ssd_ant_resnet50_v1_fpn': SSDAntResnet50V1FpnFeatureExtractor, 86 | 'retinanet_50': RetinaNet50FeatureExtractor, 87 | 'retinanet_101': RetinaNet101FeatureExtractor 88 | } 89 | 90 | SSD_KERAS_FEATURE_EXTRACTOR_CLASS_MAP = { 91 | 'ssd_mobilenet_v1_keras': SSDMobileNetV1KerasFeatureExtractor, 92 | 'ssd_mobilenet_v1_fpn_keras': SSDMobileNetV1FpnKerasFeatureExtractor, 93 | 'ssd_mobilenet_v2_keras': SSDMobileNetV2KerasFeatureExtractor, 94 | 'ssd_mobilenet_v2_fpn_keras': SSDMobileNetV2FpnKerasFeatureExtractor, 95 | } 96 | 97 | # A map of names to Faster R-CNN feature extractors. 98 | FASTER_RCNN_FEATURE_EXTRACTOR_CLASS_MAP = { 99 | 'faster_rcnn_nas': 100 | frcnn_nas.FasterRCNNNASFeatureExtractor, 101 | 'faster_rcnn_pnas': 102 | frcnn_pnas.FasterRCNNPNASFeatureExtractor, 103 | 'faster_rcnn_inception_resnet_v2': 104 | frcnn_inc_res.FasterRCNNInceptionResnetV2FeatureExtractor, 105 | 'faster_rcnn_inception_v2': 106 | frcnn_inc_v2.FasterRCNNInceptionV2FeatureExtractor, 107 | 'faster_rcnn_resnet50': 108 | frcnn_resnet_v1.FasterRCNNResnet50FeatureExtractor, 109 | 'faster_rcnn_resnet101': 110 | frcnn_resnet_v1.FasterRCNNResnet101FeatureExtractor, 111 | 'faster_rcnn_resnet152': 112 | frcnn_resnet_v1.FasterRCNNResnet152FeatureExtractor, 113 | } 114 | 115 | FASTER_RCNN_KERAS_FEATURE_EXTRACTOR_CLASS_MAP = { 116 | 'faster_rcnn_inception_resnet_v2_keras': 117 | frcnn_inc_res_keras.FasterRCNNInceptionResnetV2KerasFeatureExtractor, 118 | } 119 | 120 | 121 | def build(model_config, is_training, add_summaries=True): 122 | """Builds a DetectionModel based on the model config. 123 | 124 | Args: 125 | model_config: A model.proto object containing the config for the desired 126 | DetectionModel. 127 | is_training: True if this model is being built for training purposes. 128 | add_summaries: Whether to add tensorflow summaries in the model graph. 129 | Returns: 130 | DetectionModel based on the config. 131 | 132 | Raises: 133 | ValueError: On invalid meta architecture or model. 134 | """ 135 | if not isinstance(model_config, model_pb2.DetectionModel): 136 | raise ValueError('model_config not of type model_pb2.DetectionModel.') 137 | meta_architecture = model_config.WhichOneof('model') 138 | if meta_architecture == 'ssd': 139 | return _build_ssd_model(model_config.ssd, is_training, add_summaries) 140 | if meta_architecture == 'faster_rcnn': 141 | return _build_faster_rcnn_model(model_config.faster_rcnn, is_training, 142 | add_summaries) 143 | raise ValueError('Unknown meta architecture: {}'.format(meta_architecture)) 144 | 145 | 146 | def _build_ssd_feature_extractor(feature_extractor_config, 147 | is_training, 148 | freeze_batchnorm, 149 | reuse_weights=None): 150 | """Builds a ssd_meta_arch.SSDFeatureExtractor based on config. 151 | 152 | Args: 153 | feature_extractor_config: A SSDFeatureExtractor proto config from ssd.proto. 154 | is_training: True if this feature extractor is being built for training. 155 | freeze_batchnorm: Whether to freeze batch norm parameters during 156 | training or not. When training with a small batch size (e.g. 1), it is 157 | desirable to freeze batch norm update and use pretrained batch norm 158 | params. 159 | reuse_weights: if the feature extractor should reuse weights. 160 | 161 | Returns: 162 | ssd_meta_arch.SSDFeatureExtractor based on config. 163 | 164 | Raises: 165 | ValueError: On invalid feature extractor type. 166 | """ 167 | feature_type = feature_extractor_config.type 168 | is_keras_extractor = feature_type in SSD_KERAS_FEATURE_EXTRACTOR_CLASS_MAP 169 | depth_multiplier = feature_extractor_config.depth_multiplier 170 | min_depth = feature_extractor_config.min_depth 171 | pad_to_multiple = feature_extractor_config.pad_to_multiple 172 | use_explicit_padding = feature_extractor_config.use_explicit_padding 173 | use_depthwise = feature_extractor_config.use_depthwise 174 | use_antialias = feature_extractor_config.use_antialias 175 | # parameters only for student net 176 | if "ssd_mobilenet_v3" in feature_type: 177 | network_version = feature_extractor_config.network_version 178 | min_feature_level = feature_extractor_config.min_feature_level 179 | max_feature_level = feature_extractor_config.max_feature_level 180 | additional_layer_depth = feature_extractor_config.additional_layer_depth 181 | 182 | if is_keras_extractor: 183 | conv_hyperparams = hyperparams_builder.KerasLayerHyperparams( 184 | feature_extractor_config.conv_hyperparams) 185 | else: 186 | conv_hyperparams = hyperparams_builder.build( 187 | feature_extractor_config.conv_hyperparams, is_training) 188 | override_base_feature_extractor_hyperparams = ( 189 | feature_extractor_config.override_base_feature_extractor_hyperparams) 190 | 191 | if (feature_type not in SSD_FEATURE_EXTRACTOR_CLASS_MAP) and ( 192 | not is_keras_extractor): 193 | raise ValueError('Unknown ssd feature_extractor: {}'.format(feature_type)) 194 | 195 | if is_keras_extractor: 196 | feature_extractor_class = SSD_KERAS_FEATURE_EXTRACTOR_CLASS_MAP[feature_type] 197 | else: 198 | feature_extractor_class = SSD_FEATURE_EXTRACTOR_CLASS_MAP[feature_type] 199 | kwargs = { 200 | 'is_training': 201 | is_training, 202 | 'depth_multiplier': 203 | depth_multiplier, 204 | 'min_depth': 205 | min_depth, 206 | 'pad_to_multiple': 207 | pad_to_multiple, 208 | 'use_explicit_padding': 209 | use_explicit_padding, 210 | 'use_depthwise': 211 | use_depthwise, 212 | 'override_base_feature_extractor_hyperparams': 213 | override_base_feature_extractor_hyperparams, 214 | } 215 | if "ssd_mobilenet_v3" in feature_type: 216 | kwargs.update({'network_version': network_version, 217 | 'min_feature_level': min_feature_level, 218 | 'max_feature_level': max_feature_level, 219 | 'additional_layer_depth': additional_layer_depth, 220 | "use_antialias": use_antialias}) 221 | if feature_extractor_config.HasField('replace_preprocessor_with_placeholder'): 222 | kwargs.update({ 223 | 'replace_preprocessor_with_placeholder': 224 | feature_extractor_config.replace_preprocessor_with_placeholder 225 | }) 226 | 227 | if is_keras_extractor: 228 | kwargs.update({ 229 | 'conv_hyperparams': conv_hyperparams, 230 | 'inplace_batchnorm_update': False, 231 | 'freeze_batchnorm': freeze_batchnorm 232 | }) 233 | else: 234 | kwargs.update({ 235 | 'conv_hyperparams_fn': conv_hyperparams, 236 | 'reuse_weights': reuse_weights, 237 | }) 238 | 239 | if feature_extractor_config.HasField('fpn'): 240 | kwargs.update({ 241 | 'fpn_min_level': 242 | feature_extractor_config.fpn.min_level, 243 | 'fpn_max_level': 244 | feature_extractor_config.fpn.max_level, 245 | 'additional_layer_depth': 246 | feature_extractor_config.fpn.additional_layer_depth, 247 | }) 248 | return feature_extractor_class(**kwargs) 249 | 250 | 251 | def _build_ssd_model(ssd_config, is_training, add_summaries): 252 | """Builds an SSD detection model based on the model config. 253 | 254 | Args: 255 | ssd_config: A ssd.proto object containing the config for the desired 256 | SSDMetaArch. 257 | is_training: True if this model is being built for training purposes. 258 | add_summaries: Whether to add tf summaries in the model. 259 | Returns: 260 | SSDMetaArch based on the config. 261 | 262 | Raises: 263 | ValueError: If ssd_config.type is not recognized (i.e. not registered in 264 | model_class_map). 265 | """ 266 | num_classes = ssd_config.num_classes 267 | feature_extractor = _build_ssd_feature_extractor( 268 | feature_extractor_config=ssd_config.feature_extractor, 269 | freeze_batchnorm=ssd_config.freeze_batchnorm, 270 | is_training=is_training) 271 | box_coder = box_coder_builder.build(ssd_config.box_coder) 272 | matcher = matcher_builder.build(ssd_config.matcher) 273 | region_similarity_calculator = sim_calc.build( 274 | ssd_config.similarity_calculator) 275 | encode_background_as_zeros = ssd_config.encode_background_as_zeros 276 | negative_class_weight = ssd_config.negative_class_weight 277 | anchor_generator = anchor_generator_builder.build( 278 | ssd_config.anchor_generator) 279 | if feature_extractor.is_keras_model: 280 | ssd_box_predictor = box_predictor_builder.build_keras( 281 | hyperparams_fn=hyperparams_builder.KerasLayerHyperparams, 282 | freeze_batchnorm=ssd_config.freeze_batchnorm, 283 | inplace_batchnorm_update=False, 284 | num_predictions_per_location_list=anchor_generator 285 | .num_anchors_per_location(), 286 | box_predictor_config=ssd_config.box_predictor, 287 | is_training=is_training, 288 | num_classes=num_classes, 289 | add_background_class=ssd_config.add_background_class) 290 | else: 291 | ssd_box_predictor = box_predictor_builder.build( 292 | hyperparams_builder.build, ssd_config.box_predictor, is_training, 293 | num_classes, ssd_config.add_background_class) 294 | image_resizer_fn = image_resizer_builder.build(ssd_config.image_resizer) 295 | non_max_suppression_fn, score_conversion_fn = post_processing_builder.build( 296 | ssd_config.post_processing) 297 | (classification_loss, localization_loss, classification_weight, 298 | localization_weight, hard_example_miner, random_example_sampler, 299 | expected_loss_weights_fn) = losses_builder.build(ssd_config.loss) 300 | normalize_loss_by_num_matches = ssd_config.normalize_loss_by_num_matches 301 | normalize_loc_loss_by_codesize = ssd_config.normalize_loc_loss_by_codesize 302 | 303 | equalization_loss_config = ops.EqualizationLossConfig( 304 | weight=ssd_config.loss.equalization_loss.weight, 305 | exclude_prefixes=ssd_config.loss.equalization_loss.exclude_prefixes) 306 | 307 | target_assigner_instance = target_assigner.TargetAssigner( 308 | region_similarity_calculator, 309 | matcher, 310 | box_coder, 311 | negative_class_weight=negative_class_weight) 312 | kwargs = {} 313 | ssd_meta_arch_fn = ssd_meta_arch.SSDMetaArch 314 | kwargs.update({"feature_extractor": feature_extractor}) 315 | 316 | return ssd_meta_arch_fn( 317 | is_training=is_training, 318 | anchor_generator=anchor_generator, 319 | box_predictor=ssd_box_predictor, 320 | box_coder=box_coder, 321 | encode_background_as_zeros=encode_background_as_zeros, 322 | image_resizer_fn=image_resizer_fn, 323 | non_max_suppression_fn=non_max_suppression_fn, 324 | score_conversion_fn=score_conversion_fn, 325 | classification_loss=classification_loss, 326 | localization_loss=localization_loss, 327 | classification_loss_weight=classification_weight, 328 | localization_loss_weight=localization_weight, 329 | normalize_loss_by_num_matches=normalize_loss_by_num_matches, 330 | hard_example_miner=hard_example_miner, 331 | target_assigner_instance=target_assigner_instance, 332 | add_summaries=add_summaries, 333 | normalize_loc_loss_by_codesize=normalize_loc_loss_by_codesize, 334 | freeze_batchnorm=ssd_config.freeze_batchnorm, 335 | inplace_batchnorm_update=ssd_config.inplace_batchnorm_update, 336 | add_background_class=ssd_config.add_background_class, 337 | explicit_background_class=ssd_config.explicit_background_class, 338 | random_example_sampler=random_example_sampler, 339 | expected_loss_weights_fn=expected_loss_weights_fn, 340 | use_confidences_as_targets=ssd_config.use_confidences_as_targets, 341 | implicit_example_weight=ssd_config.implicit_example_weight, 342 | equalization_loss_config=equalization_loss_config, 343 | **kwargs) 344 | 345 | 346 | def _build_faster_rcnn_feature_extractor( 347 | feature_extractor_config, is_training, reuse_weights=None, 348 | inplace_batchnorm_update=False): 349 | """Builds a faster_rcnn_meta_arch.FasterRCNNFeatureExtractor based on config. 350 | 351 | Args: 352 | feature_extractor_config: A FasterRcnnFeatureExtractor proto config from 353 | faster_rcnn.proto. 354 | is_training: True if this feature extractor is being built for training. 355 | reuse_weights: if the feature extractor should reuse weights. 356 | inplace_batchnorm_update: Whether to update batch_norm inplace during 357 | training. This is required for batch norm to work correctly on TPUs. When 358 | this is false, user must add a control dependency on 359 | tf.GraphKeys.UPDATE_OPS for train/loss op in order to update the batch 360 | norm moving average parameters. 361 | 362 | Returns: 363 | faster_rcnn_meta_arch.FasterRCNNFeatureExtractor based on config. 364 | 365 | Raises: 366 | ValueError: On invalid feature extractor type. 367 | """ 368 | if inplace_batchnorm_update: 369 | raise ValueError('inplace batchnorm updates not supported.') 370 | feature_type = feature_extractor_config.type 371 | first_stage_features_stride = ( 372 | feature_extractor_config.first_stage_features_stride) 373 | batch_norm_trainable = feature_extractor_config.batch_norm_trainable 374 | 375 | if feature_type not in FASTER_RCNN_FEATURE_EXTRACTOR_CLASS_MAP: 376 | raise ValueError('Unknown Faster R-CNN feature_extractor: {}'.format( 377 | feature_type)) 378 | feature_extractor_class = FASTER_RCNN_FEATURE_EXTRACTOR_CLASS_MAP[ 379 | feature_type] 380 | return feature_extractor_class( 381 | is_training, first_stage_features_stride, 382 | batch_norm_trainable, reuse_weights=reuse_weights) 383 | 384 | 385 | def _build_faster_rcnn_keras_feature_extractor( 386 | feature_extractor_config, is_training, 387 | inplace_batchnorm_update=False): 388 | """Builds a faster_rcnn_meta_arch.FasterRCNNKerasFeatureExtractor from config. 389 | 390 | Args: 391 | feature_extractor_config: A FasterRcnnFeatureExtractor proto config from 392 | faster_rcnn.proto. 393 | is_training: True if this feature extractor is being built for training. 394 | inplace_batchnorm_update: Whether to update batch_norm inplace during 395 | training. This is required for batch norm to work correctly on TPUs. When 396 | this is false, user must add a control dependency on 397 | tf.GraphKeys.UPDATE_OPS for train/loss op in order to update the batch 398 | norm moving average parameters. 399 | 400 | Returns: 401 | faster_rcnn_meta_arch.FasterRCNNKerasFeatureExtractor based on config. 402 | 403 | Raises: 404 | ValueError: On invalid feature extractor type. 405 | """ 406 | if inplace_batchnorm_update: 407 | raise ValueError('inplace batchnorm updates not supported.') 408 | feature_type = feature_extractor_config.type 409 | first_stage_features_stride = ( 410 | feature_extractor_config.first_stage_features_stride) 411 | batch_norm_trainable = feature_extractor_config.batch_norm_trainable 412 | 413 | if feature_type not in FASTER_RCNN_KERAS_FEATURE_EXTRACTOR_CLASS_MAP: 414 | raise ValueError('Unknown Faster R-CNN feature_extractor: {}'.format( 415 | feature_type)) 416 | feature_extractor_class = FASTER_RCNN_KERAS_FEATURE_EXTRACTOR_CLASS_MAP[ 417 | feature_type] 418 | return feature_extractor_class( 419 | is_training, first_stage_features_stride, 420 | batch_norm_trainable) 421 | 422 | 423 | def _build_faster_rcnn_model(frcnn_config, is_training, add_summaries): 424 | """Builds a Faster R-CNN or R-FCN detection model based on the model config. 425 | 426 | Builds R-FCN model if the second_stage_box_predictor in the config is of type 427 | `rfcn_box_predictor` else builds a Faster R-CNN model. 428 | 429 | Args: 430 | frcnn_config: A faster_rcnn.proto object containing the config for the 431 | desired FasterRCNNMetaArch or RFCNMetaArch. 432 | is_training: True if this model is being built for training purposes. 433 | add_summaries: Whether to add tf summaries in the model. 434 | 435 | Returns: 436 | FasterRCNNMetaArch based on the config. 437 | 438 | Raises: 439 | ValueError: If frcnn_config.type is not recognized (i.e. not registered in 440 | model_class_map). 441 | """ 442 | num_classes = frcnn_config.num_classes 443 | image_resizer_fn = image_resizer_builder.build(frcnn_config.image_resizer) 444 | 445 | is_keras = (frcnn_config.feature_extractor.type in 446 | FASTER_RCNN_KERAS_FEATURE_EXTRACTOR_CLASS_MAP) 447 | 448 | if is_keras: 449 | feature_extractor = _build_faster_rcnn_keras_feature_extractor( 450 | frcnn_config.feature_extractor, is_training, 451 | inplace_batchnorm_update=frcnn_config.inplace_batchnorm_update) 452 | else: 453 | feature_extractor = _build_faster_rcnn_feature_extractor( 454 | frcnn_config.feature_extractor, is_training, 455 | inplace_batchnorm_update=frcnn_config.inplace_batchnorm_update) 456 | 457 | number_of_stages = frcnn_config.number_of_stages 458 | first_stage_anchor_generator = anchor_generator_builder.build( 459 | frcnn_config.first_stage_anchor_generator) 460 | 461 | first_stage_target_assigner = target_assigner.create_target_assigner( 462 | 'FasterRCNN', 463 | 'proposal', 464 | use_matmul_gather=frcnn_config.use_matmul_gather_in_matcher) 465 | first_stage_atrous_rate = frcnn_config.first_stage_atrous_rate 466 | if is_keras: 467 | first_stage_box_predictor_arg_scope_fn = ( 468 | hyperparams_builder.KerasLayerHyperparams( 469 | frcnn_config.first_stage_box_predictor_conv_hyperparams)) 470 | else: 471 | first_stage_box_predictor_arg_scope_fn = hyperparams_builder.build( 472 | frcnn_config.first_stage_box_predictor_conv_hyperparams, is_training) 473 | first_stage_box_predictor_kernel_size = ( 474 | frcnn_config.first_stage_box_predictor_kernel_size) 475 | first_stage_box_predictor_depth = frcnn_config.first_stage_box_predictor_depth 476 | first_stage_minibatch_size = frcnn_config.first_stage_minibatch_size 477 | use_static_shapes = frcnn_config.use_static_shapes and ( 478 | frcnn_config.use_static_shapes_for_eval or is_training) 479 | first_stage_sampler = sampler.BalancedPositiveNegativeSampler( 480 | positive_fraction=frcnn_config.first_stage_positive_balance_fraction, 481 | is_static=(frcnn_config.use_static_balanced_label_sampler and 482 | use_static_shapes)) 483 | first_stage_max_proposals = frcnn_config.first_stage_max_proposals 484 | if (frcnn_config.first_stage_nms_iou_threshold < 0 or 485 | frcnn_config.first_stage_nms_iou_threshold > 1.0): 486 | raise ValueError('iou_threshold not in [0, 1.0].') 487 | if (is_training and frcnn_config.second_stage_batch_size > 488 | first_stage_max_proposals): 489 | raise ValueError('second_stage_batch_size should be no greater than ' 490 | 'first_stage_max_proposals.') 491 | first_stage_non_max_suppression_fn = functools.partial( 492 | post_processing.batch_multiclass_non_max_suppression, 493 | score_thresh=frcnn_config.first_stage_nms_score_threshold, 494 | iou_thresh=frcnn_config.first_stage_nms_iou_threshold, 495 | max_size_per_class=frcnn_config.first_stage_max_proposals, 496 | max_total_size=frcnn_config.first_stage_max_proposals, 497 | use_static_shapes=use_static_shapes) 498 | first_stage_loc_loss_weight = ( 499 | frcnn_config.first_stage_localization_loss_weight) 500 | first_stage_obj_loss_weight = frcnn_config.first_stage_objectness_loss_weight 501 | 502 | initial_crop_size = frcnn_config.initial_crop_size 503 | maxpool_kernel_size = frcnn_config.maxpool_kernel_size 504 | maxpool_stride = frcnn_config.maxpool_stride 505 | 506 | second_stage_target_assigner = target_assigner.create_target_assigner( 507 | 'FasterRCNN', 508 | 'detection', 509 | use_matmul_gather=frcnn_config.use_matmul_gather_in_matcher) 510 | if is_keras: 511 | second_stage_box_predictor = box_predictor_builder.build_keras( 512 | hyperparams_builder.KerasLayerHyperparams, 513 | freeze_batchnorm=False, 514 | inplace_batchnorm_update=False, 515 | num_predictions_per_location_list=[1], 516 | box_predictor_config=frcnn_config.second_stage_box_predictor, 517 | is_training=is_training, 518 | num_classes=num_classes) 519 | else: 520 | second_stage_box_predictor = box_predictor_builder.build( 521 | hyperparams_builder.build, 522 | frcnn_config.second_stage_box_predictor, 523 | is_training=is_training, 524 | num_classes=num_classes) 525 | second_stage_batch_size = frcnn_config.second_stage_batch_size 526 | second_stage_sampler = sampler.BalancedPositiveNegativeSampler( 527 | positive_fraction=frcnn_config.second_stage_balance_fraction, 528 | is_static=(frcnn_config.use_static_balanced_label_sampler and 529 | use_static_shapes)) 530 | (second_stage_non_max_suppression_fn, second_stage_score_conversion_fn 531 | ) = post_processing_builder.build(frcnn_config.second_stage_post_processing) 532 | second_stage_localization_loss_weight = ( 533 | frcnn_config.second_stage_localization_loss_weight) 534 | second_stage_classification_loss = ( 535 | losses_builder.build_faster_rcnn_classification_loss( 536 | frcnn_config.second_stage_classification_loss)) 537 | second_stage_classification_loss_weight = ( 538 | frcnn_config.second_stage_classification_loss_weight) 539 | second_stage_mask_prediction_loss_weight = ( 540 | frcnn_config.second_stage_mask_prediction_loss_weight) 541 | 542 | hard_example_miner = None 543 | if frcnn_config.HasField('hard_example_miner'): 544 | hard_example_miner = losses_builder.build_hard_example_miner( 545 | frcnn_config.hard_example_miner, 546 | second_stage_classification_loss_weight, 547 | second_stage_localization_loss_weight) 548 | 549 | crop_and_resize_fn = ( 550 | ops.matmul_crop_and_resize if frcnn_config.use_matmul_crop_and_resize 551 | else ops.native_crop_and_resize) 552 | clip_anchors_to_image = ( 553 | frcnn_config.clip_anchors_to_image) 554 | 555 | common_kwargs = { 556 | 'is_training': is_training, 557 | 'num_classes': num_classes, 558 | 'image_resizer_fn': image_resizer_fn, 559 | 'feature_extractor': feature_extractor, 560 | 'number_of_stages': number_of_stages, 561 | 'first_stage_anchor_generator': first_stage_anchor_generator, 562 | 'first_stage_target_assigner': first_stage_target_assigner, 563 | 'first_stage_atrous_rate': first_stage_atrous_rate, 564 | 'first_stage_box_predictor_arg_scope_fn': 565 | first_stage_box_predictor_arg_scope_fn, 566 | 'first_stage_box_predictor_kernel_size': 567 | first_stage_box_predictor_kernel_size, 568 | 'first_stage_box_predictor_depth': first_stage_box_predictor_depth, 569 | 'first_stage_minibatch_size': first_stage_minibatch_size, 570 | 'first_stage_sampler': first_stage_sampler, 571 | 'first_stage_non_max_suppression_fn': first_stage_non_max_suppression_fn, 572 | 'first_stage_max_proposals': first_stage_max_proposals, 573 | 'first_stage_localization_loss_weight': first_stage_loc_loss_weight, 574 | 'first_stage_objectness_loss_weight': first_stage_obj_loss_weight, 575 | 'second_stage_target_assigner': second_stage_target_assigner, 576 | 'second_stage_batch_size': second_stage_batch_size, 577 | 'second_stage_sampler': second_stage_sampler, 578 | 'second_stage_non_max_suppression_fn': 579 | second_stage_non_max_suppression_fn, 580 | 'second_stage_score_conversion_fn': second_stage_score_conversion_fn, 581 | 'second_stage_localization_loss_weight': 582 | second_stage_localization_loss_weight, 583 | 'second_stage_classification_loss': 584 | second_stage_classification_loss, 585 | 'second_stage_classification_loss_weight': 586 | second_stage_classification_loss_weight, 587 | 'hard_example_miner': hard_example_miner, 588 | 'add_summaries': add_summaries, 589 | 'crop_and_resize_fn': crop_and_resize_fn, 590 | 'clip_anchors_to_image': clip_anchors_to_image, 591 | 'use_static_shapes': use_static_shapes, 592 | 'resize_masks': frcnn_config.resize_masks 593 | } 594 | 595 | if (isinstance(second_stage_box_predictor, 596 | rfcn_box_predictor.RfcnBoxPredictor) or 597 | isinstance(second_stage_box_predictor, 598 | rfcn_keras_box_predictor.RfcnKerasBoxPredictor)): 599 | return rfcn_meta_arch.RFCNMetaArch( 600 | second_stage_rfcn_box_predictor=second_stage_box_predictor, 601 | **common_kwargs) 602 | else: 603 | return faster_rcnn_meta_arch.FasterRCNNMetaArch( 604 | initial_crop_size=initial_crop_size, 605 | maxpool_kernel_size=maxpool_kernel_size, 606 | maxpool_stride=maxpool_stride, 607 | second_stage_mask_rcnn_box_predictor=second_stage_box_predictor, 608 | second_stage_mask_prediction_loss_weight=( 609 | second_stage_mask_prediction_loss_weight), 610 | **common_kwargs) 611 | -------------------------------------------------------------------------------- /retinanet.py: -------------------------------------------------------------------------------- 1 | """Contains definition of RetinaNet architecture. 2 | 3 | As described by Tsung-Yi Lin, Piotr Dollar, Ross Girshick, Kaiming He et.al 4 | Feature Pyramid Networks for Object Detection. arXiv: 1612.03144 5 | 6 | FPN: input shape [batch, 224, 224, 3] 7 | with slim.arg_scope(fpn.fpn_arg_scope(is_training)): 8 | net, endpoints = fpn.fpn101(inputs, 9 | blocks=[2, 4, 23, 3], 10 | is_training=False) 11 | """ 12 | import tensorflow as tf 13 | import math 14 | from object_detection.utils.shape_utils import combined_static_and_dynamic_shape 15 | 16 | # tf.enable_eager_execution() 17 | BN_PARAMS = {"bn_decay": 0.997, 18 | "bn_epsilon": 1e-4} 19 | 20 | 21 | # define number of layers of each block for different architecture 22 | RESNET_ARCH_BLOCK = {"resnet50": [3, 4, 6, 3], 23 | "resnet101": [3, 4, 23, 3]} 24 | 25 | 26 | def nearest_neighbor_upsampling(input_tensor, scale): 27 | """Nearest neighbor upsampling implementation. 28 | NOTE: See TensorFlow Object Detection API uitls.ops 29 | Args: 30 | input_tensor: A float32 tensor of size [batch, height_in, width_in, channels]. 31 | scale: An integer multiple to scale resolution of input data. 32 | Returns: 33 | upsample_input: A float32 tensor of size [batch, height_in*scale, width_in*scale, channels]. 34 | """ 35 | with tf.name_scope('nearest_neighbor_upsampling'): 36 | (batch_size, h, w, c) = combined_static_and_dynamic_shape(input_tensor) 37 | output_tensor = tf.reshape(input_tensor, [batch_size, h, 1, w, 1, c]) * tf.ones( 38 | [1, 1, scale, 1, scale, 1], dtype=input_tensor.dtype) 39 | return tf.reshape(output_tensor, [batch_size, h*scale, w*scale, c]) 40 | 41 | 42 | def conv2d_same(inputs, depth, kernel_size, strides, scope=None): 43 | with tf.name_scope(scope, None): 44 | if strides == 1: 45 | return tf.layers.conv2d(inputs, depth, kernel_size, padding='SAME') 46 | else: 47 | pad_total = kernel_size - 1 48 | pad_beg = pad_total // 2 49 | pad_end = pad_total - pad_beg 50 | inputs = tf.pad(inputs, [[0, 0], [pad_beg, pad_end], [pad_beg, pad_end], [0, 0]]) 51 | return tf.layers.conv2d(inputs, 52 | depth, 53 | kernel_size, 54 | strides=strides, 55 | padding='VALID', 56 | use_bias=False, 57 | kernel_initializer=tf.variance_scaling_initializer()) 58 | 59 | 60 | def bn_with_relu(inputs, is_training, relu=True, init_zero=False, name=None): 61 | if not init_zero: 62 | gamma_init = tf.ones_initializer() 63 | else: 64 | gamma_init = tf.zeros_initializer() 65 | inputs = tf.layers.batch_normalization(inputs, 66 | training=is_training, 67 | momentum=BN_PARAMS["bn_decay"], 68 | epsilon=BN_PARAMS["bn_epsilon"], 69 | scale=True, 70 | fused=True, 71 | gamma_initializer=gamma_init, 72 | name=name) 73 | if relu: 74 | inputs = tf.nn.relu(inputs) 75 | return inputs 76 | 77 | 78 | def bottleneck(inputs, depth, strides, is_training, projection=False, scope=None): 79 | """Bottleneck residual unit variant with BN after convolutions 80 | When putting together 2 consecutive ResNet blocks that use this unit, 81 | one should use stride =2 in the last unit of first block 82 | 83 | Args: 84 | inputs: A tensor of size [batchsize, height, width, channels] (after BN) 85 | depth: The depth of the block unit output 86 | strides: the ResNet unit's stride. Determines the amount of downsampling of 87 | the units output compared to its input 88 | is_training: indicate training state for BN layer 89 | projection: if this block will use a projection. True for first block in block groups 90 | scope: Optional variable scope 91 | 92 | Returns: 93 | The ResNet unit output 94 | """ 95 | with tf.variable_scope(scope, 'bottleneck', [inputs]) as sc: 96 | # shortcut connection 97 | shortcut = inputs 98 | depth_out = depth * 4 99 | if projection: 100 | shortcut = conv2d_same(shortcut, depth_out, kernel_size=1, strides=strides, scope='shortcut') 101 | shortcut = bn_with_relu(shortcut, is_training, relu=False) 102 | # layer1 103 | residual = conv2d_same(inputs, depth, kernel_size=1, strides=1, scope='conv1') 104 | residual = bn_with_relu(residual, is_training) 105 | # layer 2 106 | residual = conv2d_same(residual, depth, kernel_size=3, strides=strides, scope='conv2') 107 | residual = bn_with_relu(residual, is_training) 108 | # layer 3 109 | residual = conv2d_same(residual, depth_out, kernel_size=1, strides=1, scope='conv3') 110 | residual = bn_with_relu(residual, is_training, relu=False, init_zero=True) 111 | output = shortcut + residual 112 | return tf.nn.relu(output) 113 | 114 | 115 | def stack_bottleneck(inputs, layers, depth, strides, is_training, scope=None): 116 | """ Stack bottleneck planes 117 | 118 | This function creates scopes for the ResNet in the form of 'block_name/plane_1, block_name/plane_2', etc. 119 | Args: 120 | layers: number of layers in this block 121 | """ 122 | with tf.variable_scope(scope, 'block', [inputs]) as sc: 123 | inputs = bottleneck(inputs, depth, strides=strides, is_training=is_training, projection=True) 124 | for i in range(1, layers): 125 | layer_scope = "unit_{}".format(i) 126 | inputs = bottleneck(inputs, depth, strides=1, is_training=is_training, scope=layer_scope) 127 | return inputs 128 | 129 | 130 | def retinanet_fpn(inputs, 131 | block_layers, 132 | depth=256, 133 | is_training=True, 134 | scope=None): 135 | """ 136 | Generator for RetinaNet FPN models. A small modification of initial FPN model for returning layers 137 | {P3, P4, P5, P6, P7}. See paper Focal Loss for Dense Object Detection. arxiv: 1708.02002 138 | 139 | P2 is discarded and P6 is obtained via 3x3 stride-2 conv on c5; P7 is computed by applying ReLU followed by 140 | 3x3 stride-2 conv on P6. P7 is to improve large object detection 141 | 142 | Returns: 143 | 5 feature map tensors: {P3, P4, P5, P6, P7} 144 | """ 145 | with tf.variable_scope(scope, 'retinanet_fpn', [inputs]) as sc: 146 | net = conv2d_same(inputs, 64, kernel_size=7, strides=2, scope='conv1') 147 | net = bn_with_relu(net, is_training) 148 | net = tf.layers.max_pooling2d(net, pool_size=3, strides=2, padding='SAME', name='pool1') 149 | # Bottom up 150 | # block 1, down-sampling is done in conv3_1, conv4_1, conv5_1 151 | p2 = stack_bottleneck(net, layers=block_layers[0], depth=64, strides=1, is_training=is_training) 152 | # block 2 153 | p3 = stack_bottleneck(p2, layers=block_layers[1], depth=128, strides=2, is_training=is_training) 154 | # block 3 155 | p4 = stack_bottleneck(p3, layers=block_layers[2], depth=256, strides=2, is_training=is_training) 156 | # block 4 157 | p5 = stack_bottleneck(p4, layers=block_layers[3], depth=512, strides=2, is_training=is_training) 158 | # lateral layer 159 | l3 = tf.layers.conv2d(p3, filters=depth, kernel_size=1, strides=1, name='l3', padding='SAME') 160 | l4 = tf.layers.conv2d(p4, filters=depth, kernel_size=1, strides=1, name='l4', padding='SAME') 161 | l5 = tf.layers.conv2d(p5, filters=depth, kernel_size=1, strides=1, name='l5', padding='SAME') 162 | # Top down 163 | p4 = nearest_neighbor_upsampling(l5, 2) + l4 164 | p3 = nearest_neighbor_upsampling(p4, 2) + l3 165 | # add post-hoc conv layers 166 | p3 = tf.layers.conv2d(p3, filters=depth, kernel_size=3, strides=1, padding='SAME', name='post-hoc-d3') 167 | p4 = tf.layers.conv2d(p4, filters=depth, kernel_size=3, strides=1, padding='SAME', name='post-hoc-d4') 168 | p5 = tf.layers.conv2d(l5, filters=depth, kernel_size=3, strides=1, padding='SAME', name='post-hoc-d5') 169 | # coarse layer: 6, 7 170 | # p6 171 | p6 = tf.layers.conv2d(p5, filters=depth, kernel_size=3, strides=2, name='conv6', padding='SAME') 172 | p6 = tf.nn.relu(p6) 173 | # P7 174 | p7 = tf.layers.conv2d(p6, filters=depth, kernel_size=3, strides=2, name='conv7', padding='SAME') 175 | # add normalization to each layer 176 | features = {3: p3, 177 | 4: p4, 178 | 5: l5, 179 | 6: p6, 180 | 7: p7} 181 | for layer in features: 182 | features[layer] = tf.layers.batch_normalization(features[layer], 183 | training=is_training, 184 | momentum=BN_PARAMS["bn_decay"], 185 | epsilon=BN_PARAMS["bn_epsilon"], 186 | center=True, 187 | scale=True, 188 | fused=True, 189 | name='p{}-bn'.format(layer)) 190 | return features 191 | 192 | 193 | def share_weight_class_net(inputs, level, num_classes, num_anchors_per_loc, num_layers_before_predictor=4, is_training=True): 194 | """ 195 | net for predicting class labels 196 | NOTE: Share same weights when called more then once on different feature maps 197 | Args: 198 | inputs: feature map with shape (batch_size, h, w, channel) 199 | level: which feature map 200 | num_classes: number of predicted classes 201 | num_anchors_per_loc: number of anchors at each spatial location in feature map 202 | num_layers_before_predictor: number of the additional conv layers before the predictor. 203 | is_training: is in training or not 204 | returns: 205 | feature with shape (batch_size, h, w, num_classes*num_anchors) 206 | """ 207 | for i in range(num_layers_before_predictor): 208 | inputs = tf.layers.conv2d(inputs, filters=256, kernel_size=3, strides=1, 209 | kernel_initializer=tf.random_normal_initializer(stddev=0.01), 210 | bias_initializer=tf.zeros_initializer(), 211 | padding="SAME", 212 | name='class_{}'.format(i)) 213 | inputs = bn_with_relu(inputs, is_training, relu=True, init_zero=False, name="class_{}_bn_level_{}".format(i, level)) 214 | outputs = tf.layers.conv2d(inputs, 215 | filters=num_classes*num_anchors_per_loc, 216 | kernel_size=3, 217 | bias_initializer=tf.constant_initializer(-math.log((1 - 0.01) / 0.01)), 218 | kernel_initializer=tf.random_normal_initializer(stddev=0.01), 219 | padding="SAME", 220 | name="class_pred") 221 | return outputs 222 | 223 | 224 | def share_weight_box_net(inputs, level, num_anchors_per_loc, num_layers_before_predictor=4, is_training=True): 225 | """ 226 | Similar to class_net with output feature shape (batch_size, h, w, num_anchors*4) 227 | """ 228 | for i in range(num_layers_before_predictor): 229 | inputs = tf.layers.conv2d(inputs, filters=256, kernel_size=3, strides=1, 230 | bias_initializer=tf.zeros_initializer(), 231 | kernel_initializer=tf.random_normal_initializer(stddev=0.01), 232 | padding="SAME", 233 | name='box_{}'.format(i)) 234 | inputs = bn_with_relu(inputs, is_training, relu=True, init_zero=False, name="box_{}_bn_level_{}".format(i, level)) 235 | outputs = tf.layers.conv2d(inputs, 236 | filters=4*num_anchors_per_loc, 237 | kernel_size=3, 238 | kernel_initializer=tf.random_normal_initializer(stddev=0.01), 239 | padding="SAME", 240 | name="box_pred") 241 | return outputs 242 | 243 | 244 | def retinanet(images, num_classes, num_anchors_per_loc, resnet_arch='resnet50', is_training=True): 245 | """ 246 | Get box prediction features and class prediction features from given images 247 | Args: 248 | images: input batch of images with shape (batch_size, h, w, 3) 249 | num_classes: number of classes for prediction 250 | num_anchors_per_loc: number of anchors at each feature map spatial location 251 | resnet_arch: name of which resnet architecture used 252 | is_training: indicate training or not 253 | return: 254 | prediciton dict: holding following items: 255 | box_predictions tensor from each feature map with shape (batch_size, num_anchors, 4) 256 | class_predictions_with_bg tensor from each feature map with shape (batch_size, num_anchors, num_class+1) 257 | feature_maps: list of tensor of feature map 258 | """ 259 | assert resnet_arch in list(RESNET_ARCH_BLOCK.keys()), "resnet architecture not defined" 260 | with tf.variable_scope('retinanet'): 261 | batch_size = combined_static_and_dynamic_shape(images)[0] 262 | features = retinanet_fpn(images, block_layers=RESNET_ARCH_BLOCK[resnet_arch], is_training=is_training) 263 | class_pred = [] 264 | box_pred = [] 265 | feature_map_list = [] 266 | num_slots = num_classes + 1 267 | with tf.variable_scope('class_net', reuse=tf.AUTO_REUSE): 268 | for level in features.keys(): 269 | class_outputs = share_weight_class_net(features[level], level, 270 | num_slots, 271 | num_anchors_per_loc, 272 | is_training=is_training) 273 | class_outputs = tf.reshape(class_outputs, shape=[batch_size, -1, num_slots]) 274 | class_pred.append(class_outputs) 275 | feature_map_list.append(features[level]) 276 | with tf.variable_scope('box_net', reuse=tf.AUTO_REUSE): 277 | for level in features.keys(): 278 | box_outputs = share_weight_box_net(features[level], level, num_anchors_per_loc, is_training=is_training) 279 | box_outputs = tf.reshape(box_outputs, shape=[batch_size, -1, 4]) 280 | box_pred.append(box_outputs) 281 | return dict(box_pred=tf.concat(box_pred, axis=1), 282 | cls_pred=tf.concat(class_pred, axis=1), 283 | feature_map_list=feature_map_list) 284 | 285 | -------------------------------------------------------------------------------- /retinanet_50_train.config: -------------------------------------------------------------------------------- 1 | # SSD with Resnet 50 v1 FPN feature extractor, shared box predictor and focal 2 | # loss (a.k.a Retinanet). 3 | # See Lin et al, https://arxiv.org/abs/1708.02002 4 | # Trained on COCO, initialized from Imagenet classification checkpoint 5 | 6 | # Achieves 35.2 mAP on COCO14 minival dataset. Doubling the number of training 7 | # steps to 50k gets 36.9 mAP 8 | 9 | # This config is TPU compatible 10 | 11 | model { 12 | ssd { 13 | inplace_batchnorm_update: true 14 | freeze_batchnorm: false 15 | num_classes: 90 16 | box_coder { 17 | faster_rcnn_box_coder { 18 | y_scale: 10.0 19 | x_scale: 10.0 20 | height_scale: 5.0 21 | width_scale: 5.0 22 | } 23 | } 24 | matcher { 25 | argmax_matcher { 26 | matched_threshold: 0.5 27 | unmatched_threshold: 0.5 28 | ignore_thresholds: false 29 | negatives_lower_than_unmatched: true 30 | force_match_for_each_row: true 31 | use_matmul_gather: true 32 | } 33 | } 34 | similarity_calculator { 35 | iou_similarity { 36 | } 37 | } 38 | encode_background_as_zeros: true 39 | anchor_generator { 40 | multiscale_anchor_generator { 41 | min_level: 3 42 | max_level: 7 43 | anchor_scale: 4.0 44 | aspect_ratios: [1.0, 2.0, 0.5] 45 | scales_per_octave: 2 46 | } 47 | } 48 | image_resizer { 49 | fixed_shape_resizer { 50 | height: 320 51 | width: 320 52 | } 53 | } 54 | box_predictor { 55 | weight_shared_convolutional_box_predictor { 56 | depth: 256 57 | class_prediction_bias_init: -4.6 58 | conv_hyperparams { 59 | activation: RELU_6, 60 | regularizer { 61 | l2_regularizer { 62 | weight: 0.0004 63 | } 64 | } 65 | initializer { 66 | random_normal_initializer { 67 | stddev: 0.01 68 | mean: 0.0 69 | } 70 | } 71 | batch_norm { 72 | scale: true, 73 | decay: 0.997, 74 | epsilon: 0.001, 75 | } 76 | } 77 | num_layers_before_predictor: 4 78 | kernel_size: 3 79 | } 80 | } 81 | feature_extractor { 82 | type: 'retinanet_50' 83 | min_depth: 16 84 | depth_multiplier: 1.0 85 | override_base_feature_extractor_hyperparams: true 86 | } 87 | loss { 88 | classification_loss { 89 | weighted_sigmoid_focal { 90 | alpha: 0.25 91 | gamma: 2.0 92 | } 93 | } 94 | localization_loss { 95 | weighted_smooth_l1 { 96 | } 97 | } 98 | classification_weight: 1.0 99 | localization_weight: 1.0 100 | } 101 | normalize_loss_by_num_matches: true 102 | normalize_loc_loss_by_codesize: true 103 | post_processing { 104 | batch_non_max_suppression { 105 | score_threshold: 1e-8 106 | iou_threshold: 0.6 107 | max_detections_per_class: 100 108 | max_total_detections: 100 109 | } 110 | score_converter: SIGMOID 111 | } 112 | } 113 | } 114 | 115 | train_config: { 116 | #fine_tune_checkpoint: "PATH_TO_BE_CONFIGURED/model.ckpt" 117 | batch_size: 4 118 | sync_replicas: true 119 | startup_delay_steps: 0 120 | replicas_to_aggregate: 8 121 | num_steps: 25000 122 | data_augmentation_options { 123 | random_horizontal_flip { 124 | } 125 | } 126 | data_augmentation_options { 127 | random_crop_image { 128 | min_object_covered: 0.0 129 | min_aspect_ratio: 0.75 130 | max_aspect_ratio: 3.0 131 | min_area: 0.75 132 | max_area: 1.0 133 | overlap_thresh: 0.0 134 | } 135 | } 136 | optimizer { 137 | momentum_optimizer: { 138 | learning_rate: { 139 | cosine_decay_learning_rate { 140 | learning_rate_base: .04 141 | total_steps: 25000 142 | warmup_learning_rate: .013333 143 | warmup_steps: 2000 144 | } 145 | } 146 | momentum_optimizer_value: 0.9 147 | } 148 | use_moving_average: false 149 | } 150 | max_number_of_boxes: 100 151 | unpad_groundtruth_tensors: false 152 | } 153 | 154 | train_input_reader: { 155 | tf_record_input_reader { 156 | input_path: "/home/arkenstone/ssd_res50_fpn/test_data/0310_train_0.record" 157 | } 158 | label_map_path: "/home/arkenstone/ssd_res50_fpn/test_data/label_map.pbtxt" 159 | } 160 | 161 | eval_config: { 162 | metrics_set: "coco_detection_metrics" 163 | use_moving_averages: false 164 | num_examples: 8000 165 | } 166 | 167 | eval_input_reader: { 168 | tf_record_input_reader { 169 | input_path: "/home/arkenstone/ssd_res50_fpn/test_data/0310_train_1.record" 170 | } 171 | label_map_path: "/home/arkenstone/ssd_res50_fpn/test_data/label_map.pbtxt" 172 | shuffle: false 173 | num_readers: 1 174 | } -------------------------------------------------------------------------------- /retinanet_feature_extractor.py: -------------------------------------------------------------------------------- 1 | # Copyright 2017 The TensorFlow Authors. All Rights Reserved. 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # http://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | # ============================================================================== 15 | """RetinaNet feature extractors based on Resnet v1. 16 | 17 | See https://arxiv.org/abs/1708.02002 for details. 18 | """ 19 | 20 | import tensorflow as tf 21 | 22 | from object_detection.meta_architectures import ssd_meta_arch 23 | from object_detection.utils import context_manager 24 | from object_detection.utils import ops 25 | from object_detection.utils import shape_utils 26 | from object_detection.models.retinanet import retinanet_fpn 27 | 28 | RESNET_ARCH_BLOCK = {"resnet50": [3, 4, 6, 3], 29 | "resnet101": [3, 4, 23, 3]} 30 | 31 | class RetinaNetFeatureExtractor(ssd_meta_arch.SSDFeatureExtractor): 32 | """SSD FPN feature extractor based on Resnet v1 architecture.""" 33 | 34 | def __init__(self, 35 | is_training, 36 | depth_multiplier, 37 | min_depth, 38 | conv_hyperparams_fn, 39 | pad_to_multiple, 40 | backbone, 41 | fpn_scope_name, 42 | min_level=3, 43 | max_level=7, 44 | additional_layer_depth=256, 45 | reuse_weights=None, 46 | use_explicit_padding=False, 47 | use_depthwise=False, 48 | override_base_feature_extractor_hyperparams=False): 49 | """RetinaNet feature extractor. 50 | 51 | Args: 52 | is_training: whether the network is in training mode. 53 | depth_multiplier: float depth multiplier for feature extractor. 54 | min_depth: minimum feature extractor depth. 55 | pad_to_multiple: the nearest multiple to zero pad the input height and 56 | width dimensions to. 57 | fpn_scope_name: scope name under which to construct the feature pyramid 58 | network. 59 | additional_layer_depth: additional feature map layer channel depth. 60 | reuse_weights: Whether to reuse variables. Default is None. 61 | use_explicit_padding: Whether to use explicit padding when extracting 62 | features. Default is False. UNUSED currently. 63 | use_depthwise: Whether to use depthwise convolutions. UNUSED currently. 64 | override_base_feature_extractor_hyperparams: Whether to override 65 | hyperparameters of the base feature extractor with the one from 66 | `conv_hyperparams_fn`. 67 | 68 | Raises: 69 | ValueError: On supplying invalid arguments for unused arguments. 70 | """ 71 | super(RetinaNetFeatureExtractor, self).__init__( 72 | is_training=is_training, 73 | depth_multiplier=depth_multiplier, 74 | min_depth=min_depth, 75 | conv_hyperparams_fn=conv_hyperparams_fn, 76 | pad_to_multiple=pad_to_multiple, 77 | reuse_weights=reuse_weights, 78 | use_explicit_padding=use_explicit_padding, 79 | use_depthwise=use_depthwise, 80 | override_base_feature_extractor_hyperparams= 81 | override_base_feature_extractor_hyperparams) 82 | if self._use_explicit_padding is True: 83 | raise ValueError('Explicit padding is not a valid option.') 84 | self._backbone = backbone 85 | self._fpn_scope_name = fpn_scope_name 86 | self._min_level = min_level 87 | self._max_level = max_level 88 | self._additional_layer_depth = additional_layer_depth 89 | 90 | def preprocess(self, resized_inputs): 91 | """SSD preprocessing. 92 | 93 | VGG style channel mean subtraction as described here: 94 | https://gist.github.com/ksimonyan/211839e770f7b538e2d8#file-readme-mdnge. 95 | Note that if the number of channels is not equal to 3, the mean subtraction 96 | will be skipped and the original resized_inputs will be returned. 97 | 98 | Args: 99 | resized_inputs: a [batch, height, width, channels] float tensor 100 | representing a batch of images. 101 | 102 | Returns: 103 | preprocessed_inputs: a [batch, height, width, channels] float tensor 104 | representing a batch of images. 105 | """ 106 | if resized_inputs.shape.as_list()[3] == 3: 107 | channel_means = [123.68, 116.779, 103.939] 108 | return resized_inputs - [[channel_means]] 109 | else: 110 | return resized_inputs 111 | 112 | def extract_features(self, preprocessed_inputs): 113 | """Extract features from preprocessed inputs. 114 | 115 | Args: 116 | preprocessed_inputs: a [batch, height, width, channels] float tensor 117 | representing a batch of images. 118 | 119 | Returns: 120 | feature_maps: a list of tensors where the ith tensor has shape 121 | [batch, height_i, width_i, depth_i] 122 | """ 123 | preprocessed_inputs = shape_utils.check_min_image_dim( 124 | 129, preprocessed_inputs) 125 | with tf.variable_scope( 126 | self._fpn_scope_name, reuse=self._reuse_weights) as scope: 127 | if self._backbone in list(RESNET_ARCH_BLOCK.keys()): 128 | block_layers = RESNET_ARCH_BLOCK[self._backbone] 129 | else: 130 | raise ValueError("Unknown backbone found! Only resnet50 or resnet101 is allowed!") 131 | image_features = retinanet_fpn(inputs=preprocessed_inputs, 132 | block_layers=block_layers, 133 | depth=self._additional_layer_depth, 134 | is_training=self._is_training) 135 | return [image_features[x] for x in range(self._min_level, self._max_level+1)] 136 | 137 | 138 | class RetinaNet50FeatureExtractor(RetinaNetFeatureExtractor): 139 | """Resnet 50 RetinaNet feature extractor.""" 140 | def __init__(self, 141 | is_training, 142 | depth_multiplier, 143 | min_depth, 144 | conv_hyperparams_fn, 145 | pad_to_multiple, 146 | backbone='resnet50', 147 | additional_layer_depth=256, 148 | reuse_weights=None, 149 | use_explicit_padding=False, 150 | use_depthwise=False, 151 | override_base_feature_extractor_hyperparams=False): 152 | """ 153 | Args: 154 | is_training: whether the network is in training mode. 155 | depth_multiplier: float depth multiplier for feature extractor. 156 | UNUSED currently. 157 | min_depth: minimum feature extractor depth. UNUSED Currently. 158 | pad_to_multiple: the nearest multiple to zero pad the input height and 159 | width dimensions to. 160 | additional_layer_depth: additional feature map layer channel depth. 161 | reuse_weights: Whether to reuse variables. Default is None. 162 | use_explicit_padding: Whether to use explicit padding when extracting 163 | features. Default is False. UNUSED currently. 164 | use_depthwise: Whether to use depthwise convolutions. UNUSED currently. 165 | override_base_feature_extractor_hyperparams: Whether to override 166 | hyperparameters of the base feature extractor with the one from 167 | `conv_hyperparams_fn`. 168 | """ 169 | super(RetinaNet50FeatureExtractor, self).__init__( 170 | is_training=is_training, 171 | depth_multiplier=depth_multiplier, 172 | min_depth=min_depth, 173 | conv_hyperparams_fn=conv_hyperparams_fn, 174 | pad_to_multiple=pad_to_multiple, 175 | backbone='resnet50', 176 | fpn_scope_name='retinanet50', 177 | additional_layer_depth=additional_layer_depth, 178 | reuse_weights=reuse_weights, 179 | use_explicit_padding=use_explicit_padding, 180 | use_depthwise=use_depthwise, 181 | override_base_feature_extractor_hyperparams= 182 | override_base_feature_extractor_hyperparams) 183 | 184 | class RetinaNet101FeatureExtractor(RetinaNetFeatureExtractor): 185 | """Resnet 101 RetinaNet feature extractor.""" 186 | def __init__(self, 187 | is_training, 188 | depth_multiplier, 189 | min_depth, 190 | conv_hyperparams_fn, 191 | pad_to_multiple, 192 | backbone='resnet101', 193 | additional_layer_depth=256, 194 | reuse_weights=None, 195 | use_explicit_padding=False, 196 | use_depthwise=False, 197 | override_base_feature_extractor_hyperparams=False): 198 | """ 199 | Args: 200 | is_training: whether the network is in training mode. 201 | depth_multiplier: float depth multiplier for feature extractor. 202 | UNUSED currently. 203 | min_depth: minimum feature extractor depth. UNUSED Currently. 204 | pad_to_multiple: the nearest multiple to zero pad the input height and 205 | width dimensions to. 206 | additional_layer_depth: additional feature map layer channel depth. 207 | reuse_weights: Whether to reuse variables. Default is None. 208 | use_explicit_padding: Whether to use explicit padding when extracting 209 | features. Default is False. UNUSED currently. 210 | use_depthwise: Whether to use depthwise convolutions. UNUSED currently. 211 | override_base_feature_extractor_hyperparams: Whether to override 212 | hyperparameters of the base feature extractor with the one from 213 | `conv_hyperparams_fn`. 214 | """ 215 | super(RetinaNet101FeatureExtractor, self).__init__( 216 | is_training=is_training, 217 | depth_multiplier=depth_multiplier, 218 | min_depth=min_depth, 219 | conv_hyperparams_fn=conv_hyperparams_fn, 220 | pad_to_multiple=pad_to_multiple, 221 | backbone='resnet101', 222 | fpn_scope_name='retinanet101', 223 | additional_layer_depth=additional_layer_depth, 224 | reuse_weights=reuse_weights, 225 | use_explicit_padding=use_explicit_padding, 226 | use_depthwise=use_depthwise, 227 | override_base_feature_extractor_hyperparams= 228 | override_base_feature_extractor_hyperparams) -------------------------------------------------------------------------------- /train.sh: -------------------------------------------------------------------------------- 1 | export PYTHONPATH="$PYTHONPATH:/PATH/TO/models/research:/PATH/TO/models/research/slim" 2 | 3 | python3 model_main.py \ 4 | --model_dir="train" --pipeline_config_path="retinanet_50_train.config" --------------------------------------------------------------------------------