├── .gitignore
├── README.md
├── loss.py
├── model_builder.py
├── retinanet.py
├── retinanet_50_train.config
├── retinanet_feature_extractor.py
└── train.sh


/.gitignore:
--------------------------------------------------------------------------------
1 | # Created by .ignore support plugin (hsz.mobi)
2 | *.pyc
3 | .idea
4 | model1
5 | data


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | ## RetinaNet tensorflow version
 2 | Unofficial realization of [retinanet](https://arxiv.org/abs/1708.02002) using tf. **NOTE** this project is written for practice, so please don't hesitate to report an issue if you find something run.
 3 | 
 4 | TF models [object detection api](https://github.com/tensorflow/models/tree/master/research/object_detection) have integrated FPN in this framework, and ssd_resnet50_v1_fpn is the synonym of RetinaNet. You could dig into ssd_resnet50_v1_feature_extractor in `models` for coding details. 
 5 | 
 6 | Since this work depends on tf in the beginning, I keep only retinanet backbone, loss and customed retinanet_feature_extractor in standard format. To make it work, here are the steps: 
 7 | - Download tensorflow [models](https://github.com/tensorflow/models) and install object detection api following [this way](https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/installation.md). 
 8 | - Add retinanet feature extractor to `model_builder.py`: 
 9 | ```python3
10 | from object_detection.models.retinanet_feature_extractor import RetinaNet50FeatureExtractor, RetinaNet101FeatureExtractor
11 | 
12 | SSD_FEATURE_EXTRACTOR_CLASS_MAP = {
13 |     ...
14 |     'retinanet_50': RetinaNet50FeatureExtractor,
15 |     'retinanet_101': RetinaNet101FeatureExtractor,
16 | }
17 | ```
18 | - Put `retinanet_feature_extractor.py` and `retinanet.py` under `models`
19 | - Modify `retinanet_50_train.config` and `train.sh` with your customed settings and data inputs. Then run `train.sh` to start training.


--------------------------------------------------------------------------------
/loss.py:
--------------------------------------------------------------------------------
 1 | import tensorflow as tf
 2 | 
 3 | 
 4 | def focal_loss(logits, onehot_labels, weights, alpha=0.25, gamma=2.0):
 5 |     """
 6 |     Compute sigmoid focal loss between logits and onehot labels: focal loss = -(1-pt)^gamma*log(pt)
 7 | 
 8 |     Args:
 9 |         onehot_labels: onehot labels with shape (batch_size, num_anchors, num_classes)
10 |         logits: last layer feature output with shape (batch_size, num_anchors, num_classes)
11 |         weights: weight tensor returned from target assigner with shape [batch_size, num_anchors]
12 |         alpha: The hyperparameter for adjusting biased samples, default is 0.25
13 |         gamma: The hyperparameter for penalizing the easy labeled samples, default is 2.0
14 |     Returns:
15 |         a scalar of focal loss of total classification
16 |     """
17 |     with tf.name_scope("focal_loss"):
18 |         logits = tf.cast(logits, tf.float32)
19 |         onehot_labels = tf.cast(onehot_labels, tf.float32)
20 |         ce = tf.nn.sigmoid_cross_entropy_with_logits(labels=onehot_labels, logits=logits)
21 |         predictions = tf.sigmoid(logits)
22 |         predictions_pt = tf.where(tf.equal(onehot_labels, 1), predictions, 1.-predictions)
23 |         # add small value to avoid 0
24 |         alpha_t = tf.scalar_mul(alpha, tf.ones_like(onehot_labels, dtype=tf.float32))
25 |         alpha_t = tf.where(tf.equal(onehot_labels, 1.0), alpha_t, 1-alpha_t)
26 |         weighted_loss = ce * tf.pow(1-predictions_pt, gamma) * alpha_t * tf.expand_dims(weights, axis=2)
27 |         return tf.reduce_sum(weighted_loss)
28 | 
29 | 
30 | def regression_loss(pred_boxes, gt_boxes, weights, delta=1.0):
31 |     """
32 |     Regression loss (Smooth L1 loss: also known as huber loss)
33 | 
34 |     Args:
35 |         pred_boxes: [batch_size, num_anchors, 4]
36 |         gt_boxes: [batch_size, num_anchors, 4]
37 |         weights: Tensor of weights multiplied by loss with shape [batch_size, num_anchors]
38 |         delta: delta for smooth L1 loss
39 |     Returns:
40 |         a box regression loss scalar
41 |     """
42 |     loss = tf.reduce_sum(tf.losses.huber_loss(predictions=pred_boxes,
43 |                                               labels=gt_boxes,
44 |                                               delta=delta,
45 |                                               weights=tf.expand_dims(weights, axis=2),
46 |                                               scope='box_loss',
47 |                                               reduction=tf.losses.Reduction.NONE))
48 |     return loss
49 | 
50 | 
51 | 


--------------------------------------------------------------------------------
/model_builder.py:
--------------------------------------------------------------------------------
  1 | # Copyright 2017 The TensorFlow Authors. All Rights Reserved.
  2 | #
  3 | # Licensed under the Apache License, Version 2.0 (the "License");
  4 | # you may not use this file except in compliance with the License.
  5 | # You may obtain a copy of the License at
  6 | #
  7 | #     http://www.apache.org/licenses/LICENSE-2.0
  8 | #
  9 | # Unless required by applicable law or agreed to in writing, software
 10 | # distributed under the License is distributed on an "AS IS" BASIS,
 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 12 | # See the License for the specific language governing permissions and
 13 | # limitations under the License.
 14 | # ==============================================================================
 15 | 
 16 | """A function to build a DetectionModel from configuration."""
 17 | 
 18 | import functools
 19 | from object_detection.builders import anchor_generator_builder
 20 | from object_detection.builders import box_coder_builder
 21 | from object_detection.builders import box_predictor_builder
 22 | from object_detection.builders import hyperparams_builder
 23 | from object_detection.builders import image_resizer_builder
 24 | from object_detection.builders import losses_builder
 25 | from object_detection.builders import matcher_builder
 26 | from object_detection.builders import post_processing_builder
 27 | from object_detection.builders import region_similarity_calculator_builder as sim_calc
 28 | from object_detection.core import balanced_positive_negative_sampler as sampler
 29 | from object_detection.core import post_processing
 30 | from object_detection.core import target_assigner
 31 | from object_detection.meta_architectures import faster_rcnn_meta_arch
 32 | from object_detection.meta_architectures import rfcn_meta_arch
 33 | from object_detection.meta_architectures import ssd_meta_arch
 34 | from object_detection.meta_architectures import ssd_meta_arch_keras
 35 | from object_detection.models import faster_rcnn_inception_resnet_v2_feature_extractor as frcnn_inc_res
 36 | from object_detection.models import faster_rcnn_inception_resnet_v2_keras_feature_extractor as frcnn_inc_res_keras
 37 | from object_detection.models import faster_rcnn_inception_v2_feature_extractor as frcnn_inc_v2
 38 | from object_detection.models import faster_rcnn_nas_feature_extractor as frcnn_nas
 39 | from object_detection.models import faster_rcnn_pnas_feature_extractor as frcnn_pnas
 40 | from object_detection.models import faster_rcnn_resnet_v1_feature_extractor as frcnn_resnet_v1
 41 | from object_detection.models import ssd_resnet_v1_fpn_feature_extractor as ssd_resnet_v1_fpn
 42 | from object_detection.models import ssd_resnet_v1_ppn_feature_extractor as ssd_resnet_v1_ppn
 43 | from object_detection.models.embedded_ssd_mobilenet_v1_feature_extractor import EmbeddedSSDMobileNetV1FeatureExtractor
 44 | from object_detection.models.ssd_inception_v2_feature_extractor import SSDInceptionV2FeatureExtractor
 45 | from object_detection.models.ssd_inception_v3_feature_extractor import SSDInceptionV3FeatureExtractor
 46 | from object_detection.models.ssd_mobilenet_v1_feature_extractor import SSDMobileNetV1FeatureExtractor
 47 | from object_detection.models.ssd_mobilenet_v1_fpn_feature_extractor import SSDMobileNetV1FpnFeatureExtractor
 48 | from object_detection.models.ssd_mobilenet_v1_fpn_keras_feature_extractor import SSDMobileNetV1FpnKerasFeatureExtractor
 49 | from object_detection.models.ssd_mobilenet_v1_keras_feature_extractor import SSDMobileNetV1KerasFeatureExtractor
 50 | from object_detection.models.ssd_mobilenet_v1_ppn_feature_extractor import SSDMobileNetV1PpnFeatureExtractor
 51 | from object_detection.models.ssd_mobilenet_v2_feature_extractor import SSDMobileNetV2FeatureExtractor
 52 | from object_detection.models.ssd_mobilenet_v2_fpn_feature_extractor import SSDMobileNetV2FpnFeatureExtractor
 53 | from object_detection.models.ssd_mobilenet_v2_fpn_keras_feature_extractor import SSDMobileNetV2FpnKerasFeatureExtractor
 54 | from object_detection.models.ssd_mobilenet_v2_keras_feature_extractor import SSDMobileNetV2KerasFeatureExtractor
 55 | from object_detection.models.ssd_pnasnet_feature_extractor import SSDPNASNetFeatureExtractor
 56 | from object_detection.models.ssd_antialiased_resnet_v1_fpn_feature_extractor import SSDAntResnet50V1FpnFeatureExtractor
 57 | from object_detection.models.retinanet_feature_extractor import RetinaNet50FeatureExtractor, RetinaNet101FeatureExtractor
 58 | from object_detection.predictors import rfcn_box_predictor
 59 | from object_detection.predictors import rfcn_keras_box_predictor
 60 | from object_detection.predictors.heads import mask_head
 61 | from object_detection.protos import model_pb2
 62 | from object_detection.utils import ops
 63 | 
 64 | # A map of names to SSD feature extractors.
 65 | SSD_FEATURE_EXTRACTOR_CLASS_MAP = {
 66 |     'ssd_inception_v2': SSDInceptionV2FeatureExtractor,
 67 |     'ssd_inception_v3': SSDInceptionV3FeatureExtractor,
 68 |     'ssd_mobilenet_v1': SSDMobileNetV1FeatureExtractor,
 69 |     'ssd_mobilenet_v1_fpn': SSDMobileNetV1FpnFeatureExtractor,
 70 |     'ssd_mobilenet_v1_ppn': SSDMobileNetV1PpnFeatureExtractor,
 71 |     'ssd_mobilenet_v2': SSDMobileNetV2FeatureExtractor,
 72 |     'ssd_mobilenet_v2_fpn': SSDMobileNetV2FpnFeatureExtractor,
 73 |     'ssd_resnet18_v1_fpn': ssd_resnet_v1_fpn.SSDResnet18V1FpnFeatureExtractor,
 74 |     'ssd_resnet22_v1_fpn': ssd_resnet_v1_fpn.SSDResnet22V1FpnFeatureExtractor,
 75 |     'ssd_resnet50_v1_fpn': ssd_resnet_v1_fpn.SSDResnet50V1FpnFeatureExtractor,
 76 |     'ssd_resnet101_v1_fpn': ssd_resnet_v1_fpn.SSDResnet101V1FpnFeatureExtractor,
 77 |     'ssd_resnet152_v1_fpn': ssd_resnet_v1_fpn.SSDResnet152V1FpnFeatureExtractor,
 78 |     'ssd_resnet50_v1_ppn': ssd_resnet_v1_ppn.SSDResnet50V1PpnFeatureExtractor,
 79 |     'ssd_resnet101_v1_ppn':
 80 |         ssd_resnet_v1_ppn.SSDResnet101V1PpnFeatureExtractor,
 81 |     'ssd_resnet152_v1_ppn':
 82 |         ssd_resnet_v1_ppn.SSDResnet152V1PpnFeatureExtractor,
 83 |     'embedded_ssd_mobilenet_v1': EmbeddedSSDMobileNetV1FeatureExtractor,
 84 |     'ssd_pnasnet': SSDPNASNetFeatureExtractor,
 85 |     'ssd_ant_resnet50_v1_fpn': SSDAntResnet50V1FpnFeatureExtractor,
 86 |     'retinanet_50': RetinaNet50FeatureExtractor,
 87 |     'retinanet_101': RetinaNet101FeatureExtractor
 88 | }
 89 | 
 90 | SSD_KERAS_FEATURE_EXTRACTOR_CLASS_MAP = {
 91 |     'ssd_mobilenet_v1_keras': SSDMobileNetV1KerasFeatureExtractor,
 92 |     'ssd_mobilenet_v1_fpn_keras': SSDMobileNetV1FpnKerasFeatureExtractor,
 93 |     'ssd_mobilenet_v2_keras': SSDMobileNetV2KerasFeatureExtractor,
 94 |     'ssd_mobilenet_v2_fpn_keras': SSDMobileNetV2FpnKerasFeatureExtractor,
 95 | }
 96 | 
 97 | # A map of names to Faster R-CNN feature extractors.
 98 | FASTER_RCNN_FEATURE_EXTRACTOR_CLASS_MAP = {
 99 |     'faster_rcnn_nas':
100 |     frcnn_nas.FasterRCNNNASFeatureExtractor,
101 |     'faster_rcnn_pnas':
102 |     frcnn_pnas.FasterRCNNPNASFeatureExtractor,
103 |     'faster_rcnn_inception_resnet_v2':
104 |     frcnn_inc_res.FasterRCNNInceptionResnetV2FeatureExtractor,
105 |     'faster_rcnn_inception_v2':
106 |     frcnn_inc_v2.FasterRCNNInceptionV2FeatureExtractor,
107 |     'faster_rcnn_resnet50':
108 |     frcnn_resnet_v1.FasterRCNNResnet50FeatureExtractor,
109 |     'faster_rcnn_resnet101':
110 |     frcnn_resnet_v1.FasterRCNNResnet101FeatureExtractor,
111 |     'faster_rcnn_resnet152':
112 |     frcnn_resnet_v1.FasterRCNNResnet152FeatureExtractor,
113 | }
114 | 
115 | FASTER_RCNN_KERAS_FEATURE_EXTRACTOR_CLASS_MAP = {
116 |     'faster_rcnn_inception_resnet_v2_keras':
117 |     frcnn_inc_res_keras.FasterRCNNInceptionResnetV2KerasFeatureExtractor,
118 | }
119 | 
120 | 
121 | def build(model_config, is_training, add_summaries=True):
122 |   """Builds a DetectionModel based on the model config.
123 | 
124 |   Args:
125 |     model_config: A model.proto object containing the config for the desired
126 |       DetectionModel.
127 |     is_training: True if this model is being built for training purposes.
128 |     add_summaries: Whether to add tensorflow summaries in the model graph.
129 |   Returns:
130 |     DetectionModel based on the config.
131 | 
132 |   Raises:
133 |     ValueError: On invalid meta architecture or model.
134 |   """
135 |   if not isinstance(model_config, model_pb2.DetectionModel):
136 |     raise ValueError('model_config not of type model_pb2.DetectionModel.')
137 |   meta_architecture = model_config.WhichOneof('model')
138 |   if meta_architecture == 'ssd':
139 |     return _build_ssd_model(model_config.ssd, is_training, add_summaries)
140 |   if meta_architecture == 'faster_rcnn':
141 |     return _build_faster_rcnn_model(model_config.faster_rcnn, is_training,
142 |                                     add_summaries)
143 |   raise ValueError('Unknown meta architecture: {}'.format(meta_architecture))
144 | 
145 | 
146 | def _build_ssd_feature_extractor(feature_extractor_config,
147 |                                  is_training,
148 |                                  freeze_batchnorm,
149 |                                  reuse_weights=None):
150 |   """Builds a ssd_meta_arch.SSDFeatureExtractor based on config.
151 | 
152 |   Args:
153 |     feature_extractor_config: A SSDFeatureExtractor proto config from ssd.proto.
154 |     is_training: True if this feature extractor is being built for training.
155 |     freeze_batchnorm: Whether to freeze batch norm parameters during
156 |       training or not. When training with a small batch size (e.g. 1), it is
157 |       desirable to freeze batch norm update and use pretrained batch norm
158 |       params.
159 |     reuse_weights: if the feature extractor should reuse weights.
160 | 
161 |   Returns:
162 |     ssd_meta_arch.SSDFeatureExtractor based on config.
163 | 
164 |   Raises:
165 |     ValueError: On invalid feature extractor type.
166 |   """
167 |   feature_type = feature_extractor_config.type
168 |   is_keras_extractor = feature_type in SSD_KERAS_FEATURE_EXTRACTOR_CLASS_MAP
169 |   depth_multiplier = feature_extractor_config.depth_multiplier
170 |   min_depth = feature_extractor_config.min_depth
171 |   pad_to_multiple = feature_extractor_config.pad_to_multiple
172 |   use_explicit_padding = feature_extractor_config.use_explicit_padding
173 |   use_depthwise = feature_extractor_config.use_depthwise
174 |   use_antialias = feature_extractor_config.use_antialias
175 |   # parameters only for student net
176 |   if "ssd_mobilenet_v3" in feature_type:
177 |     network_version = feature_extractor_config.network_version
178 |     min_feature_level = feature_extractor_config.min_feature_level
179 |     max_feature_level = feature_extractor_config.max_feature_level
180 |     additional_layer_depth = feature_extractor_config.additional_layer_depth
181 | 
182 |   if is_keras_extractor:
183 |     conv_hyperparams = hyperparams_builder.KerasLayerHyperparams(
184 |         feature_extractor_config.conv_hyperparams)
185 |   else:
186 |     conv_hyperparams = hyperparams_builder.build(
187 |         feature_extractor_config.conv_hyperparams, is_training)
188 |   override_base_feature_extractor_hyperparams = (
189 |       feature_extractor_config.override_base_feature_extractor_hyperparams)
190 | 
191 |   if (feature_type not in SSD_FEATURE_EXTRACTOR_CLASS_MAP) and (
192 |       not is_keras_extractor):
193 |     raise ValueError('Unknown ssd feature_extractor: {}'.format(feature_type))
194 | 
195 |   if is_keras_extractor:
196 |     feature_extractor_class = SSD_KERAS_FEATURE_EXTRACTOR_CLASS_MAP[feature_type]
197 |   else:
198 |     feature_extractor_class = SSD_FEATURE_EXTRACTOR_CLASS_MAP[feature_type]
199 |   kwargs = {
200 |       'is_training':
201 |           is_training,
202 |       'depth_multiplier':
203 |           depth_multiplier,
204 |       'min_depth':
205 |           min_depth,
206 |       'pad_to_multiple':
207 |           pad_to_multiple,
208 |       'use_explicit_padding':
209 |           use_explicit_padding,
210 |       'use_depthwise':
211 |           use_depthwise,
212 |       'override_base_feature_extractor_hyperparams':
213 |           override_base_feature_extractor_hyperparams,
214 |   }
215 |   if "ssd_mobilenet_v3" in feature_type:
216 |     kwargs.update({'network_version': network_version,
217 |                  'min_feature_level': min_feature_level,
218 |                  'max_feature_level': max_feature_level,
219 |                  'additional_layer_depth': additional_layer_depth,
220 |                  "use_antialias": use_antialias})
221 |   if feature_extractor_config.HasField('replace_preprocessor_with_placeholder'):
222 |     kwargs.update({
223 |         'replace_preprocessor_with_placeholder':
224 |             feature_extractor_config.replace_preprocessor_with_placeholder
225 |     })
226 | 
227 |   if is_keras_extractor:
228 |     kwargs.update({
229 |         'conv_hyperparams': conv_hyperparams,
230 |         'inplace_batchnorm_update': False,
231 |         'freeze_batchnorm': freeze_batchnorm
232 |     })
233 |   else:
234 |     kwargs.update({
235 |         'conv_hyperparams_fn': conv_hyperparams,
236 |         'reuse_weights': reuse_weights,
237 |     })
238 | 
239 |   if feature_extractor_config.HasField('fpn'):
240 |     kwargs.update({
241 |         'fpn_min_level':
242 |             feature_extractor_config.fpn.min_level,
243 |         'fpn_max_level':
244 |             feature_extractor_config.fpn.max_level,
245 |         'additional_layer_depth':
246 |             feature_extractor_config.fpn.additional_layer_depth,
247 |     })
248 |   return feature_extractor_class(**kwargs)
249 | 
250 | 
251 | def _build_ssd_model(ssd_config, is_training, add_summaries):
252 |   """Builds an SSD detection model based on the model config.
253 | 
254 |   Args:
255 |     ssd_config: A ssd.proto object containing the config for the desired
256 |       SSDMetaArch.
257 |     is_training: True if this model is being built for training purposes.
258 |     add_summaries: Whether to add tf summaries in the model.
259 |   Returns:
260 |     SSDMetaArch based on the config.
261 | 
262 |   Raises:
263 |     ValueError: If ssd_config.type is not recognized (i.e. not registered in
264 |       model_class_map).
265 |   """
266 |   num_classes = ssd_config.num_classes
267 |   feature_extractor = _build_ssd_feature_extractor(
268 |       feature_extractor_config=ssd_config.feature_extractor,
269 |       freeze_batchnorm=ssd_config.freeze_batchnorm,
270 |       is_training=is_training)
271 |   box_coder = box_coder_builder.build(ssd_config.box_coder)
272 |   matcher = matcher_builder.build(ssd_config.matcher)
273 |   region_similarity_calculator = sim_calc.build(
274 |       ssd_config.similarity_calculator)
275 |   encode_background_as_zeros = ssd_config.encode_background_as_zeros
276 |   negative_class_weight = ssd_config.negative_class_weight
277 |   anchor_generator = anchor_generator_builder.build(
278 |       ssd_config.anchor_generator)
279 |   if feature_extractor.is_keras_model:
280 |     ssd_box_predictor = box_predictor_builder.build_keras(
281 |         hyperparams_fn=hyperparams_builder.KerasLayerHyperparams,
282 |         freeze_batchnorm=ssd_config.freeze_batchnorm,
283 |         inplace_batchnorm_update=False,
284 |         num_predictions_per_location_list=anchor_generator
285 |         .num_anchors_per_location(),
286 |         box_predictor_config=ssd_config.box_predictor,
287 |         is_training=is_training,
288 |         num_classes=num_classes,
289 |         add_background_class=ssd_config.add_background_class)
290 |   else:
291 |     ssd_box_predictor = box_predictor_builder.build(
292 |         hyperparams_builder.build, ssd_config.box_predictor, is_training,
293 |         num_classes, ssd_config.add_background_class)
294 |   image_resizer_fn = image_resizer_builder.build(ssd_config.image_resizer)
295 |   non_max_suppression_fn, score_conversion_fn = post_processing_builder.build(
296 |       ssd_config.post_processing)
297 |   (classification_loss, localization_loss, classification_weight,
298 |    localization_weight, hard_example_miner, random_example_sampler,
299 |    expected_loss_weights_fn) = losses_builder.build(ssd_config.loss)
300 |   normalize_loss_by_num_matches = ssd_config.normalize_loss_by_num_matches
301 |   normalize_loc_loss_by_codesize = ssd_config.normalize_loc_loss_by_codesize
302 | 
303 |   equalization_loss_config = ops.EqualizationLossConfig(
304 |       weight=ssd_config.loss.equalization_loss.weight,
305 |       exclude_prefixes=ssd_config.loss.equalization_loss.exclude_prefixes)
306 | 
307 |   target_assigner_instance = target_assigner.TargetAssigner(
308 |       region_similarity_calculator,
309 |       matcher,
310 |       box_coder,
311 |       negative_class_weight=negative_class_weight)
312 |   kwargs = {}
313 |   ssd_meta_arch_fn = ssd_meta_arch.SSDMetaArch
314 |   kwargs.update({"feature_extractor": feature_extractor})
315 |   
316 |   return ssd_meta_arch_fn(
317 |       is_training=is_training,
318 |       anchor_generator=anchor_generator,
319 |       box_predictor=ssd_box_predictor,
320 |       box_coder=box_coder,
321 |       encode_background_as_zeros=encode_background_as_zeros,
322 |       image_resizer_fn=image_resizer_fn,
323 |       non_max_suppression_fn=non_max_suppression_fn,
324 |       score_conversion_fn=score_conversion_fn,
325 |       classification_loss=classification_loss,
326 |       localization_loss=localization_loss,
327 |       classification_loss_weight=classification_weight,
328 |       localization_loss_weight=localization_weight,
329 |       normalize_loss_by_num_matches=normalize_loss_by_num_matches,
330 |       hard_example_miner=hard_example_miner,
331 |       target_assigner_instance=target_assigner_instance,
332 |       add_summaries=add_summaries,
333 |       normalize_loc_loss_by_codesize=normalize_loc_loss_by_codesize,
334 |       freeze_batchnorm=ssd_config.freeze_batchnorm,
335 |       inplace_batchnorm_update=ssd_config.inplace_batchnorm_update,
336 |       add_background_class=ssd_config.add_background_class,
337 |       explicit_background_class=ssd_config.explicit_background_class,
338 |       random_example_sampler=random_example_sampler,
339 |       expected_loss_weights_fn=expected_loss_weights_fn,
340 |       use_confidences_as_targets=ssd_config.use_confidences_as_targets,
341 |       implicit_example_weight=ssd_config.implicit_example_weight,
342 |       equalization_loss_config=equalization_loss_config,
343 |       **kwargs)
344 | 
345 | 
346 | def _build_faster_rcnn_feature_extractor(
347 |     feature_extractor_config, is_training, reuse_weights=None,
348 |     inplace_batchnorm_update=False):
349 |   """Builds a faster_rcnn_meta_arch.FasterRCNNFeatureExtractor based on config.
350 | 
351 |   Args:
352 |     feature_extractor_config: A FasterRcnnFeatureExtractor proto config from
353 |       faster_rcnn.proto.
354 |     is_training: True if this feature extractor is being built for training.
355 |     reuse_weights: if the feature extractor should reuse weights.
356 |     inplace_batchnorm_update: Whether to update batch_norm inplace during
357 |       training. This is required for batch norm to work correctly on TPUs. When
358 |       this is false, user must add a control dependency on
359 |       tf.GraphKeys.UPDATE_OPS for train/loss op in order to update the batch
360 |       norm moving average parameters.
361 | 
362 |   Returns:
363 |     faster_rcnn_meta_arch.FasterRCNNFeatureExtractor based on config.
364 | 
365 |   Raises:
366 |     ValueError: On invalid feature extractor type.
367 |   """
368 |   if inplace_batchnorm_update:
369 |     raise ValueError('inplace batchnorm updates not supported.')
370 |   feature_type = feature_extractor_config.type
371 |   first_stage_features_stride = (
372 |       feature_extractor_config.first_stage_features_stride)
373 |   batch_norm_trainable = feature_extractor_config.batch_norm_trainable
374 | 
375 |   if feature_type not in FASTER_RCNN_FEATURE_EXTRACTOR_CLASS_MAP:
376 |     raise ValueError('Unknown Faster R-CNN feature_extractor: {}'.format(
377 |         feature_type))
378 |   feature_extractor_class = FASTER_RCNN_FEATURE_EXTRACTOR_CLASS_MAP[
379 |       feature_type]
380 |   return feature_extractor_class(
381 |       is_training, first_stage_features_stride,
382 |       batch_norm_trainable, reuse_weights=reuse_weights)
383 | 
384 | 
385 | def _build_faster_rcnn_keras_feature_extractor(
386 |     feature_extractor_config, is_training,
387 |     inplace_batchnorm_update=False):
388 |   """Builds a faster_rcnn_meta_arch.FasterRCNNKerasFeatureExtractor from config.
389 | 
390 |   Args:
391 |     feature_extractor_config: A FasterRcnnFeatureExtractor proto config from
392 |       faster_rcnn.proto.
393 |     is_training: True if this feature extractor is being built for training.
394 |     inplace_batchnorm_update: Whether to update batch_norm inplace during
395 |       training. This is required for batch norm to work correctly on TPUs. When
396 |       this is false, user must add a control dependency on
397 |       tf.GraphKeys.UPDATE_OPS for train/loss op in order to update the batch
398 |       norm moving average parameters.
399 | 
400 |   Returns:
401 |     faster_rcnn_meta_arch.FasterRCNNKerasFeatureExtractor based on config.
402 | 
403 |   Raises:
404 |     ValueError: On invalid feature extractor type.
405 |   """
406 |   if inplace_batchnorm_update:
407 |     raise ValueError('inplace batchnorm updates not supported.')
408 |   feature_type = feature_extractor_config.type
409 |   first_stage_features_stride = (
410 |       feature_extractor_config.first_stage_features_stride)
411 |   batch_norm_trainable = feature_extractor_config.batch_norm_trainable
412 | 
413 |   if feature_type not in FASTER_RCNN_KERAS_FEATURE_EXTRACTOR_CLASS_MAP:
414 |     raise ValueError('Unknown Faster R-CNN feature_extractor: {}'.format(
415 |         feature_type))
416 |   feature_extractor_class = FASTER_RCNN_KERAS_FEATURE_EXTRACTOR_CLASS_MAP[
417 |       feature_type]
418 |   return feature_extractor_class(
419 |       is_training, first_stage_features_stride,
420 |       batch_norm_trainable)
421 | 
422 | 
423 | def _build_faster_rcnn_model(frcnn_config, is_training, add_summaries):
424 |   """Builds a Faster R-CNN or R-FCN detection model based on the model config.
425 | 
426 |   Builds R-FCN model if the second_stage_box_predictor in the config is of type
427 |   `rfcn_box_predictor` else builds a Faster R-CNN model.
428 | 
429 |   Args:
430 |     frcnn_config: A faster_rcnn.proto object containing the config for the
431 |       desired FasterRCNNMetaArch or RFCNMetaArch.
432 |     is_training: True if this model is being built for training purposes.
433 |     add_summaries: Whether to add tf summaries in the model.
434 | 
435 |   Returns:
436 |     FasterRCNNMetaArch based on the config.
437 | 
438 |   Raises:
439 |     ValueError: If frcnn_config.type is not recognized (i.e. not registered in
440 |       model_class_map).
441 |   """
442 |   num_classes = frcnn_config.num_classes
443 |   image_resizer_fn = image_resizer_builder.build(frcnn_config.image_resizer)
444 | 
445 |   is_keras = (frcnn_config.feature_extractor.type in
446 |               FASTER_RCNN_KERAS_FEATURE_EXTRACTOR_CLASS_MAP)
447 | 
448 |   if is_keras:
449 |     feature_extractor = _build_faster_rcnn_keras_feature_extractor(
450 |         frcnn_config.feature_extractor, is_training,
451 |         inplace_batchnorm_update=frcnn_config.inplace_batchnorm_update)
452 |   else:
453 |     feature_extractor = _build_faster_rcnn_feature_extractor(
454 |         frcnn_config.feature_extractor, is_training,
455 |         inplace_batchnorm_update=frcnn_config.inplace_batchnorm_update)
456 | 
457 |   number_of_stages = frcnn_config.number_of_stages
458 |   first_stage_anchor_generator = anchor_generator_builder.build(
459 |       frcnn_config.first_stage_anchor_generator)
460 | 
461 |   first_stage_target_assigner = target_assigner.create_target_assigner(
462 |       'FasterRCNN',
463 |       'proposal',
464 |       use_matmul_gather=frcnn_config.use_matmul_gather_in_matcher)
465 |   first_stage_atrous_rate = frcnn_config.first_stage_atrous_rate
466 |   if is_keras:
467 |     first_stage_box_predictor_arg_scope_fn = (
468 |         hyperparams_builder.KerasLayerHyperparams(
469 |             frcnn_config.first_stage_box_predictor_conv_hyperparams))
470 |   else:
471 |     first_stage_box_predictor_arg_scope_fn = hyperparams_builder.build(
472 |         frcnn_config.first_stage_box_predictor_conv_hyperparams, is_training)
473 |   first_stage_box_predictor_kernel_size = (
474 |       frcnn_config.first_stage_box_predictor_kernel_size)
475 |   first_stage_box_predictor_depth = frcnn_config.first_stage_box_predictor_depth
476 |   first_stage_minibatch_size = frcnn_config.first_stage_minibatch_size
477 |   use_static_shapes = frcnn_config.use_static_shapes and (
478 |       frcnn_config.use_static_shapes_for_eval or is_training)
479 |   first_stage_sampler = sampler.BalancedPositiveNegativeSampler(
480 |       positive_fraction=frcnn_config.first_stage_positive_balance_fraction,
481 |       is_static=(frcnn_config.use_static_balanced_label_sampler and
482 |                  use_static_shapes))
483 |   first_stage_max_proposals = frcnn_config.first_stage_max_proposals
484 |   if (frcnn_config.first_stage_nms_iou_threshold < 0 or
485 |       frcnn_config.first_stage_nms_iou_threshold > 1.0):
486 |     raise ValueError('iou_threshold not in [0, 1.0].')
487 |   if (is_training and frcnn_config.second_stage_batch_size >
488 |       first_stage_max_proposals):
489 |     raise ValueError('second_stage_batch_size should be no greater than '
490 |                      'first_stage_max_proposals.')
491 |   first_stage_non_max_suppression_fn = functools.partial(
492 |       post_processing.batch_multiclass_non_max_suppression,
493 |       score_thresh=frcnn_config.first_stage_nms_score_threshold,
494 |       iou_thresh=frcnn_config.first_stage_nms_iou_threshold,
495 |       max_size_per_class=frcnn_config.first_stage_max_proposals,
496 |       max_total_size=frcnn_config.first_stage_max_proposals,
497 |       use_static_shapes=use_static_shapes)
498 |   first_stage_loc_loss_weight = (
499 |       frcnn_config.first_stage_localization_loss_weight)
500 |   first_stage_obj_loss_weight = frcnn_config.first_stage_objectness_loss_weight
501 | 
502 |   initial_crop_size = frcnn_config.initial_crop_size
503 |   maxpool_kernel_size = frcnn_config.maxpool_kernel_size
504 |   maxpool_stride = frcnn_config.maxpool_stride
505 | 
506 |   second_stage_target_assigner = target_assigner.create_target_assigner(
507 |       'FasterRCNN',
508 |       'detection',
509 |       use_matmul_gather=frcnn_config.use_matmul_gather_in_matcher)
510 |   if is_keras:
511 |     second_stage_box_predictor = box_predictor_builder.build_keras(
512 |         hyperparams_builder.KerasLayerHyperparams,
513 |         freeze_batchnorm=False,
514 |         inplace_batchnorm_update=False,
515 |         num_predictions_per_location_list=[1],
516 |         box_predictor_config=frcnn_config.second_stage_box_predictor,
517 |         is_training=is_training,
518 |         num_classes=num_classes)
519 |   else:
520 |     second_stage_box_predictor = box_predictor_builder.build(
521 |         hyperparams_builder.build,
522 |         frcnn_config.second_stage_box_predictor,
523 |         is_training=is_training,
524 |         num_classes=num_classes)
525 |   second_stage_batch_size = frcnn_config.second_stage_batch_size
526 |   second_stage_sampler = sampler.BalancedPositiveNegativeSampler(
527 |       positive_fraction=frcnn_config.second_stage_balance_fraction,
528 |       is_static=(frcnn_config.use_static_balanced_label_sampler and
529 |                  use_static_shapes))
530 |   (second_stage_non_max_suppression_fn, second_stage_score_conversion_fn
531 |   ) = post_processing_builder.build(frcnn_config.second_stage_post_processing)
532 |   second_stage_localization_loss_weight = (
533 |       frcnn_config.second_stage_localization_loss_weight)
534 |   second_stage_classification_loss = (
535 |       losses_builder.build_faster_rcnn_classification_loss(
536 |           frcnn_config.second_stage_classification_loss))
537 |   second_stage_classification_loss_weight = (
538 |       frcnn_config.second_stage_classification_loss_weight)
539 |   second_stage_mask_prediction_loss_weight = (
540 |       frcnn_config.second_stage_mask_prediction_loss_weight)
541 | 
542 |   hard_example_miner = None
543 |   if frcnn_config.HasField('hard_example_miner'):
544 |     hard_example_miner = losses_builder.build_hard_example_miner(
545 |         frcnn_config.hard_example_miner,
546 |         second_stage_classification_loss_weight,
547 |         second_stage_localization_loss_weight)
548 | 
549 |   crop_and_resize_fn = (
550 |       ops.matmul_crop_and_resize if frcnn_config.use_matmul_crop_and_resize
551 |       else ops.native_crop_and_resize)
552 |   clip_anchors_to_image = (
553 |       frcnn_config.clip_anchors_to_image)
554 | 
555 |   common_kwargs = {
556 |       'is_training': is_training,
557 |       'num_classes': num_classes,
558 |       'image_resizer_fn': image_resizer_fn,
559 |       'feature_extractor': feature_extractor,
560 |       'number_of_stages': number_of_stages,
561 |       'first_stage_anchor_generator': first_stage_anchor_generator,
562 |       'first_stage_target_assigner': first_stage_target_assigner,
563 |       'first_stage_atrous_rate': first_stage_atrous_rate,
564 |       'first_stage_box_predictor_arg_scope_fn':
565 |       first_stage_box_predictor_arg_scope_fn,
566 |       'first_stage_box_predictor_kernel_size':
567 |       first_stage_box_predictor_kernel_size,
568 |       'first_stage_box_predictor_depth': first_stage_box_predictor_depth,
569 |       'first_stage_minibatch_size': first_stage_minibatch_size,
570 |       'first_stage_sampler': first_stage_sampler,
571 |       'first_stage_non_max_suppression_fn': first_stage_non_max_suppression_fn,
572 |       'first_stage_max_proposals': first_stage_max_proposals,
573 |       'first_stage_localization_loss_weight': first_stage_loc_loss_weight,
574 |       'first_stage_objectness_loss_weight': first_stage_obj_loss_weight,
575 |       'second_stage_target_assigner': second_stage_target_assigner,
576 |       'second_stage_batch_size': second_stage_batch_size,
577 |       'second_stage_sampler': second_stage_sampler,
578 |       'second_stage_non_max_suppression_fn':
579 |       second_stage_non_max_suppression_fn,
580 |       'second_stage_score_conversion_fn': second_stage_score_conversion_fn,
581 |       'second_stage_localization_loss_weight':
582 |       second_stage_localization_loss_weight,
583 |       'second_stage_classification_loss':
584 |       second_stage_classification_loss,
585 |       'second_stage_classification_loss_weight':
586 |       second_stage_classification_loss_weight,
587 |       'hard_example_miner': hard_example_miner,
588 |       'add_summaries': add_summaries,
589 |       'crop_and_resize_fn': crop_and_resize_fn,
590 |       'clip_anchors_to_image': clip_anchors_to_image,
591 |       'use_static_shapes': use_static_shapes,
592 |       'resize_masks': frcnn_config.resize_masks
593 |   }
594 | 
595 |   if (isinstance(second_stage_box_predictor,
596 |                  rfcn_box_predictor.RfcnBoxPredictor) or
597 |       isinstance(second_stage_box_predictor,
598 |                  rfcn_keras_box_predictor.RfcnKerasBoxPredictor)):
599 |     return rfcn_meta_arch.RFCNMetaArch(
600 |         second_stage_rfcn_box_predictor=second_stage_box_predictor,
601 |         **common_kwargs)
602 |   else:
603 |     return faster_rcnn_meta_arch.FasterRCNNMetaArch(
604 |         initial_crop_size=initial_crop_size,
605 |         maxpool_kernel_size=maxpool_kernel_size,
606 |         maxpool_stride=maxpool_stride,
607 |         second_stage_mask_rcnn_box_predictor=second_stage_box_predictor,
608 |         second_stage_mask_prediction_loss_weight=(
609 |             second_stage_mask_prediction_loss_weight),
610 |         **common_kwargs)
611 | 


--------------------------------------------------------------------------------
/retinanet.py:
--------------------------------------------------------------------------------
  1 | """Contains definition of RetinaNet architecture.
  2 | 
  3 | As described by Tsung-Yi Lin, Piotr Dollar, Ross Girshick, Kaiming He et.al
  4 |     Feature Pyramid Networks for Object Detection. arXiv: 1612.03144
  5 | 
  6 | FPN: input shape [batch, 224, 224, 3]
  7 |     with slim.arg_scope(fpn.fpn_arg_scope(is_training)):
  8 |         net, endpoints = fpn.fpn101(inputs,
  9 |                                     blocks=[2, 4, 23, 3],
 10 |                                     is_training=False)
 11 | """
 12 | import tensorflow as tf
 13 | import math
 14 | from object_detection.utils.shape_utils import combined_static_and_dynamic_shape
 15 | 
 16 | # tf.enable_eager_execution()
 17 | BN_PARAMS = {"bn_decay": 0.997,
 18 |              "bn_epsilon": 1e-4}
 19 | 
 20 | 
 21 | # define number of layers of each block for different architecture
 22 | RESNET_ARCH_BLOCK = {"resnet50": [3, 4, 6, 3],
 23 |                      "resnet101": [3, 4, 23, 3]}
 24 | 
 25 | 
 26 | def nearest_neighbor_upsampling(input_tensor, scale):
 27 |     """Nearest neighbor upsampling implementation.
 28 |     NOTE: See TensorFlow Object Detection API uitls.ops
 29 |     Args:
 30 |         input_tensor: A float32 tensor of size [batch, height_in, width_in, channels].
 31 |         scale: An integer multiple to scale resolution of input data.
 32 |     Returns:
 33 |         upsample_input: A float32 tensor of size [batch, height_in*scale, width_in*scale, channels].
 34 |     """
 35 |     with tf.name_scope('nearest_neighbor_upsampling'):
 36 |         (batch_size, h, w, c) = combined_static_and_dynamic_shape(input_tensor)
 37 |         output_tensor = tf.reshape(input_tensor, [batch_size, h, 1, w, 1, c]) * tf.ones(
 38 |                 [1, 1, scale, 1, scale, 1], dtype=input_tensor.dtype)
 39 |         return tf.reshape(output_tensor, [batch_size, h*scale, w*scale, c])
 40 | 
 41 | 
 42 | def conv2d_same(inputs, depth, kernel_size, strides, scope=None):
 43 |     with tf.name_scope(scope, None):
 44 |         if strides == 1:
 45 |             return tf.layers.conv2d(inputs, depth, kernel_size, padding='SAME')
 46 |         else:
 47 |             pad_total = kernel_size - 1
 48 |             pad_beg = pad_total // 2
 49 |             pad_end = pad_total - pad_beg
 50 |             inputs = tf.pad(inputs, [[0, 0], [pad_beg, pad_end], [pad_beg, pad_end], [0, 0]])
 51 |             return tf.layers.conv2d(inputs,
 52 |                                     depth,
 53 |                                     kernel_size,
 54 |                                     strides=strides,
 55 |                                     padding='VALID',
 56 |                                     use_bias=False,
 57 |                                     kernel_initializer=tf.variance_scaling_initializer())
 58 | 
 59 | 
 60 | def bn_with_relu(inputs, is_training, relu=True, init_zero=False, name=None):
 61 |     if not init_zero:
 62 |         gamma_init = tf.ones_initializer()
 63 |     else:
 64 |         gamma_init = tf.zeros_initializer()
 65 |     inputs = tf.layers.batch_normalization(inputs,
 66 |                                            training=is_training,
 67 |                                            momentum=BN_PARAMS["bn_decay"],
 68 |                                            epsilon=BN_PARAMS["bn_epsilon"],
 69 |                                            scale=True,
 70 |                                            fused=True,
 71 |                                            gamma_initializer=gamma_init,
 72 |                                            name=name)
 73 |     if relu:
 74 |         inputs = tf.nn.relu(inputs)
 75 |     return inputs
 76 | 
 77 | 
 78 | def bottleneck(inputs, depth, strides, is_training, projection=False, scope=None):
 79 |     """Bottleneck residual unit variant with BN after convolutions
 80 |        When putting together 2 consecutive ResNet blocks that use this unit,
 81 |        one should use stride =2 in the last unit of first block
 82 | 
 83 |     Args:
 84 |         inputs: A tensor of size [batchsize, height, width, channels] (after BN)
 85 |         depth: The depth of the block unit output
 86 |         strides: the ResNet unit's stride. Determines the amount of downsampling of
 87 |             the units output compared to its input
 88 |         is_training: indicate training state for BN layer
 89 |         projection: if this block will use a projection. True for first block in block groups
 90 |         scope: Optional variable scope
 91 | 
 92 |     Returns:
 93 |         The ResNet unit output
 94 |     """
 95 |     with tf.variable_scope(scope, 'bottleneck', [inputs]) as sc:
 96 |         # shortcut connection
 97 |         shortcut = inputs
 98 |         depth_out = depth * 4
 99 |         if projection:
100 |             shortcut = conv2d_same(shortcut, depth_out, kernel_size=1, strides=strides, scope='shortcut')
101 |             shortcut = bn_with_relu(shortcut, is_training, relu=False)
102 |         # layer1
103 |         residual = conv2d_same(inputs, depth, kernel_size=1, strides=1, scope='conv1')
104 |         residual = bn_with_relu(residual, is_training)
105 |         # layer 2
106 |         residual = conv2d_same(residual, depth, kernel_size=3, strides=strides, scope='conv2')
107 |         residual = bn_with_relu(residual, is_training)
108 |         # layer 3
109 |         residual = conv2d_same(residual, depth_out, kernel_size=1, strides=1, scope='conv3')
110 |         residual = bn_with_relu(residual, is_training, relu=False, init_zero=True)
111 |         output = shortcut + residual
112 |         return tf.nn.relu(output)
113 | 
114 | 
115 | def stack_bottleneck(inputs, layers, depth, strides, is_training, scope=None):
116 |     """ Stack bottleneck planes
117 | 
118 |     This function creates scopes for the ResNet in the form of 'block_name/plane_1, block_name/plane_2', etc.
119 |     Args:
120 |         layers: number of layers in this block
121 |     """
122 |     with tf.variable_scope(scope, 'block', [inputs]) as sc:
123 |         inputs = bottleneck(inputs, depth, strides=strides, is_training=is_training, projection=True)
124 |         for i in range(1, layers):
125 |             layer_scope = "unit_{}".format(i)
126 |             inputs = bottleneck(inputs, depth, strides=1, is_training=is_training, scope=layer_scope)
127 |     return inputs
128 | 
129 | 
130 | def retinanet_fpn(inputs,
131 |                   block_layers,
132 |                   depth=256,
133 |                   is_training=True,
134 |                   scope=None):
135 |     """
136 |     Generator for RetinaNet FPN models. A small modification of initial FPN model for returning layers
137 |         {P3, P4, P5, P6, P7}. See paper Focal Loss for Dense Object Detection. arxiv: 1708.02002
138 | 
139 |         P2 is discarded and P6 is obtained via 3x3 stride-2 conv on c5; P7 is computed by applying ReLU followed by
140 |         3x3 stride-2 conv on P6. P7 is to improve large object detection
141 | 
142 |     Returns:
143 |         5 feature map tensors: {P3, P4, P5, P6, P7}
144 |     """
145 |     with tf.variable_scope(scope, 'retinanet_fpn', [inputs]) as sc:
146 |         net = conv2d_same(inputs, 64, kernel_size=7, strides=2, scope='conv1')
147 |         net = bn_with_relu(net, is_training)
148 |         net = tf.layers.max_pooling2d(net, pool_size=3, strides=2, padding='SAME', name='pool1')
149 |         # Bottom up
150 |         # block 1, down-sampling is done in conv3_1, conv4_1, conv5_1
151 |         p2 = stack_bottleneck(net, layers=block_layers[0], depth=64, strides=1, is_training=is_training)
152 |         # block 2
153 |         p3 = stack_bottleneck(p2, layers=block_layers[1], depth=128, strides=2, is_training=is_training)
154 |         # block 3
155 |         p4 = stack_bottleneck(p3, layers=block_layers[2], depth=256, strides=2, is_training=is_training)
156 |         # block 4
157 |         p5 = stack_bottleneck(p4, layers=block_layers[3], depth=512, strides=2, is_training=is_training)
158 |         # lateral layer
159 |         l3 = tf.layers.conv2d(p3, filters=depth, kernel_size=1, strides=1, name='l3', padding='SAME')
160 |         l4 = tf.layers.conv2d(p4, filters=depth, kernel_size=1, strides=1, name='l4', padding='SAME')
161 |         l5 = tf.layers.conv2d(p5, filters=depth, kernel_size=1, strides=1, name='l5', padding='SAME')
162 |         # Top down
163 |         p4 = nearest_neighbor_upsampling(l5, 2) + l4
164 |         p3 = nearest_neighbor_upsampling(p4, 2) + l3
165 |         # add post-hoc conv layers
166 |         p3 = tf.layers.conv2d(p3, filters=depth, kernel_size=3, strides=1, padding='SAME', name='post-hoc-d3')
167 |         p4 = tf.layers.conv2d(p4, filters=depth, kernel_size=3, strides=1, padding='SAME', name='post-hoc-d4')
168 |         p5 = tf.layers.conv2d(l5, filters=depth, kernel_size=3, strides=1, padding='SAME', name='post-hoc-d5')
169 |         # coarse layer: 6, 7
170 |         # p6
171 |         p6 = tf.layers.conv2d(p5, filters=depth, kernel_size=3, strides=2, name='conv6', padding='SAME')
172 |         p6 = tf.nn.relu(p6)
173 |         # P7
174 |         p7 = tf.layers.conv2d(p6, filters=depth, kernel_size=3, strides=2, name='conv7', padding='SAME')
175 |         # add normalization to each layer
176 |         features = {3: p3,
177 |                     4: p4,
178 |                     5: l5,
179 |                     6: p6,
180 |                     7: p7}
181 |         for layer in features:
182 |             features[layer] = tf.layers.batch_normalization(features[layer],
183 |                                                             training=is_training,
184 |                                                             momentum=BN_PARAMS["bn_decay"],
185 |                                                             epsilon=BN_PARAMS["bn_epsilon"],
186 |                                                             center=True,
187 |                                                             scale=True,
188 |                                                             fused=True,
189 |                                                             name='p{}-bn'.format(layer))
190 |         return features
191 | 
192 | 
193 | def share_weight_class_net(inputs, level, num_classes, num_anchors_per_loc, num_layers_before_predictor=4, is_training=True):
194 |     """
195 |     net for predicting class labels
196 |     NOTE: Share same weights when called more then once on different feature maps
197 |     Args:
198 |         inputs: feature map with shape (batch_size, h, w, channel)
199 |         level: which feature map
200 |         num_classes: number of predicted classes
201 |         num_anchors_per_loc: number of anchors at each spatial location in feature map
202 |         num_layers_before_predictor: number of the additional conv layers before the predictor.
203 |         is_training: is in training or not
204 |     returns:
205 |         feature with shape (batch_size, h, w, num_classes*num_anchors)
206 |     """
207 |     for i in range(num_layers_before_predictor):
208 |         inputs = tf.layers.conv2d(inputs, filters=256, kernel_size=3, strides=1,
209 |                                   kernel_initializer=tf.random_normal_initializer(stddev=0.01),
210 |                                   bias_initializer=tf.zeros_initializer(),
211 |                                   padding="SAME",
212 |                                   name='class_{}'.format(i))
213 |         inputs = bn_with_relu(inputs, is_training, relu=True, init_zero=False, name="class_{}_bn_level_{}".format(i, level))
214 |     outputs = tf.layers.conv2d(inputs,
215 |                                filters=num_classes*num_anchors_per_loc,
216 |                                kernel_size=3,
217 |                                bias_initializer=tf.constant_initializer(-math.log((1 - 0.01) / 0.01)),
218 |                                kernel_initializer=tf.random_normal_initializer(stddev=0.01),
219 |                                padding="SAME",
220 |                                name="class_pred")
221 |     return outputs
222 | 
223 | 
224 | def share_weight_box_net(inputs, level, num_anchors_per_loc, num_layers_before_predictor=4, is_training=True):
225 |     """
226 |     Similar to class_net with output feature shape (batch_size, h, w, num_anchors*4)
227 |     """
228 |     for i in range(num_layers_before_predictor):
229 |         inputs = tf.layers.conv2d(inputs, filters=256, kernel_size=3, strides=1,
230 |                                   bias_initializer=tf.zeros_initializer(),
231 |                                   kernel_initializer=tf.random_normal_initializer(stddev=0.01),
232 |                                   padding="SAME",
233 |                                   name='box_{}'.format(i))
234 |         inputs = bn_with_relu(inputs, is_training, relu=True, init_zero=False, name="box_{}_bn_level_{}".format(i, level))
235 |     outputs = tf.layers.conv2d(inputs,
236 |                                filters=4*num_anchors_per_loc,
237 |                                kernel_size=3,
238 |                                kernel_initializer=tf.random_normal_initializer(stddev=0.01),
239 |                                padding="SAME",
240 |                                name="box_pred")
241 |     return outputs
242 | 
243 | 
244 | def retinanet(images, num_classes, num_anchors_per_loc, resnet_arch='resnet50', is_training=True):
245 |     """
246 |     Get box prediction features and class prediction features from given images
247 |     Args:
248 |         images: input batch of images with shape (batch_size, h, w, 3)
249 |         num_classes: number of classes for prediction
250 |         num_anchors_per_loc: number of anchors at each feature map spatial location
251 |         resnet_arch: name of which resnet architecture used
252 |         is_training: indicate training or not
253 |     return:
254 |         prediciton dict: holding following items:
255 |             box_predictions tensor from each feature map with shape (batch_size, num_anchors, 4)
256 |             class_predictions_with_bg tensor from each feature map with shape (batch_size, num_anchors, num_class+1)
257 |             feature_maps: list of tensor of feature map
258 |     """
259 |     assert resnet_arch in list(RESNET_ARCH_BLOCK.keys()), "resnet architecture not defined"
260 |     with tf.variable_scope('retinanet'):
261 |         batch_size = combined_static_and_dynamic_shape(images)[0]
262 |         features = retinanet_fpn(images, block_layers=RESNET_ARCH_BLOCK[resnet_arch], is_training=is_training)
263 |         class_pred = []
264 |         box_pred = []
265 |         feature_map_list = []
266 |         num_slots = num_classes + 1
267 |         with tf.variable_scope('class_net', reuse=tf.AUTO_REUSE):
268 |             for level in features.keys():
269 |                 class_outputs = share_weight_class_net(features[level], level,
270 |                                                        num_slots,
271 |                                                        num_anchors_per_loc,
272 |                                                        is_training=is_training)
273 |                 class_outputs = tf.reshape(class_outputs, shape=[batch_size, -1, num_slots])
274 |                 class_pred.append(class_outputs)
275 |                 feature_map_list.append(features[level])
276 |         with tf.variable_scope('box_net', reuse=tf.AUTO_REUSE):
277 |             for level in features.keys():
278 |                 box_outputs = share_weight_box_net(features[level], level, num_anchors_per_loc, is_training=is_training)
279 |                 box_outputs = tf.reshape(box_outputs, shape=[batch_size, -1, 4])
280 |                 box_pred.append(box_outputs)
281 |         return dict(box_pred=tf.concat(box_pred, axis=1),
282 |                     cls_pred=tf.concat(class_pred, axis=1),
283 |                     feature_map_list=feature_map_list)
284 | 
285 | 


--------------------------------------------------------------------------------
/retinanet_50_train.config:
--------------------------------------------------------------------------------
  1 | # SSD with Resnet 50 v1 FPN feature extractor, shared box predictor and focal
  2 | # loss (a.k.a Retinanet).
  3 | # See Lin et al, https://arxiv.org/abs/1708.02002
  4 | # Trained on COCO, initialized from Imagenet classification checkpoint
  5 | 
  6 | # Achieves 35.2 mAP on COCO14 minival dataset. Doubling the number of training
  7 | # steps to 50k gets 36.9 mAP
  8 | 
  9 | # This config is TPU compatible
 10 | 
 11 | model {
 12 |   ssd {
 13 |     inplace_batchnorm_update: true
 14 |     freeze_batchnorm: false
 15 |     num_classes: 90
 16 |     box_coder {
 17 |       faster_rcnn_box_coder {
 18 |         y_scale: 10.0
 19 |         x_scale: 10.0
 20 |         height_scale: 5.0
 21 |         width_scale: 5.0
 22 |       }
 23 |     }
 24 |     matcher {
 25 |       argmax_matcher {
 26 |         matched_threshold: 0.5
 27 |         unmatched_threshold: 0.5
 28 |         ignore_thresholds: false
 29 |         negatives_lower_than_unmatched: true
 30 |         force_match_for_each_row: true
 31 |         use_matmul_gather: true
 32 |       }
 33 |     }
 34 |     similarity_calculator {
 35 |       iou_similarity {
 36 |       }
 37 |     }
 38 |     encode_background_as_zeros: true
 39 |     anchor_generator {
 40 |       multiscale_anchor_generator {
 41 |         min_level: 3
 42 |         max_level: 7
 43 |         anchor_scale: 4.0
 44 |         aspect_ratios: [1.0, 2.0, 0.5]
 45 |         scales_per_octave: 2
 46 |       }
 47 |     }
 48 |     image_resizer {
 49 |       fixed_shape_resizer {
 50 |         height: 320
 51 |         width: 320
 52 |       }
 53 |     }
 54 |     box_predictor {
 55 |       weight_shared_convolutional_box_predictor {
 56 |         depth: 256
 57 |         class_prediction_bias_init: -4.6
 58 |         conv_hyperparams {
 59 |           activation: RELU_6,
 60 |           regularizer {
 61 |             l2_regularizer {
 62 |               weight: 0.0004
 63 |             }
 64 |           }
 65 |           initializer {
 66 |             random_normal_initializer {
 67 |               stddev: 0.01
 68 |               mean: 0.0
 69 |             }
 70 |           }
 71 |           batch_norm {
 72 |             scale: true,
 73 |             decay: 0.997,
 74 |             epsilon: 0.001,
 75 |           }
 76 |         }
 77 |         num_layers_before_predictor: 4
 78 |         kernel_size: 3
 79 |       }
 80 |     }
 81 |     feature_extractor {
 82 |       type: 'retinanet_50'
 83 |       min_depth: 16
 84 |       depth_multiplier: 1.0
 85 |       override_base_feature_extractor_hyperparams: true
 86 |     }
 87 |     loss {
 88 |       classification_loss {
 89 |         weighted_sigmoid_focal {
 90 |           alpha: 0.25
 91 |           gamma: 2.0
 92 |         }
 93 |       }
 94 |       localization_loss {
 95 |         weighted_smooth_l1 {
 96 |         }
 97 |       }
 98 |       classification_weight: 1.0
 99 |       localization_weight: 1.0
100 |     }
101 |     normalize_loss_by_num_matches: true
102 |     normalize_loc_loss_by_codesize: true
103 |     post_processing {
104 |       batch_non_max_suppression {
105 |         score_threshold: 1e-8
106 |         iou_threshold: 0.6
107 |         max_detections_per_class: 100
108 |         max_total_detections: 100
109 |       }
110 |       score_converter: SIGMOID
111 |     }
112 |   }
113 | }
114 | 
115 | train_config: {
116 |   #fine_tune_checkpoint: "PATH_TO_BE_CONFIGURED/model.ckpt"
117 |   batch_size: 4
118 |   sync_replicas: true
119 |   startup_delay_steps: 0
120 |   replicas_to_aggregate: 8
121 |   num_steps: 25000
122 |   data_augmentation_options {
123 |     random_horizontal_flip {
124 |     }
125 |   }
126 |   data_augmentation_options {
127 |     random_crop_image {
128 |       min_object_covered: 0.0
129 |       min_aspect_ratio: 0.75
130 |       max_aspect_ratio: 3.0
131 |       min_area: 0.75
132 |       max_area: 1.0
133 |       overlap_thresh: 0.0
134 |     }
135 |   }
136 |   optimizer {
137 |     momentum_optimizer: {
138 |       learning_rate: {
139 |         cosine_decay_learning_rate {
140 |           learning_rate_base: .04
141 |           total_steps: 25000
142 |           warmup_learning_rate: .013333
143 |           warmup_steps: 2000
144 |         }
145 |       }
146 |       momentum_optimizer_value: 0.9
147 |     }
148 |     use_moving_average: false
149 |   }
150 |   max_number_of_boxes: 100
151 |   unpad_groundtruth_tensors: false
152 | }
153 | 
154 | train_input_reader: {
155 |   tf_record_input_reader {
156 |     input_path:  "/home/arkenstone/ssd_res50_fpn/test_data/0310_train_0.record"
157 |   }
158 |   label_map_path: "/home/arkenstone/ssd_res50_fpn/test_data/label_map.pbtxt"
159 | }
160 | 
161 | eval_config: {
162 |   metrics_set: "coco_detection_metrics"
163 |   use_moving_averages: false
164 |   num_examples: 8000
165 | }
166 | 
167 | eval_input_reader: {
168 |   tf_record_input_reader {
169 |     input_path:  "/home/arkenstone/ssd_res50_fpn/test_data/0310_train_1.record"
170 |   }
171 |   label_map_path: "/home/arkenstone/ssd_res50_fpn/test_data/label_map.pbtxt"
172 |   shuffle: false
173 |   num_readers: 1
174 | }


--------------------------------------------------------------------------------
/retinanet_feature_extractor.py:
--------------------------------------------------------------------------------
  1 | # Copyright 2017 The TensorFlow Authors. All Rights Reserved.
  2 | #
  3 | # Licensed under the Apache License, Version 2.0 (the "License");
  4 | # you may not use this file except in compliance with the License.
  5 | # You may obtain a copy of the License at
  6 | #
  7 | #     http://www.apache.org/licenses/LICENSE-2.0
  8 | #
  9 | # Unless required by applicable law or agreed to in writing, software
 10 | # distributed under the License is distributed on an "AS IS" BASIS,
 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 12 | # See the License for the specific language governing permissions and
 13 | # limitations under the License.
 14 | # ==============================================================================
 15 | """RetinaNet feature extractors based on Resnet v1.
 16 | 
 17 | See https://arxiv.org/abs/1708.02002 for details.
 18 | """
 19 | 
 20 | import tensorflow as tf
 21 | 
 22 | from object_detection.meta_architectures import ssd_meta_arch
 23 | from object_detection.utils import context_manager
 24 | from object_detection.utils import ops
 25 | from object_detection.utils import shape_utils
 26 | from object_detection.models.retinanet import retinanet_fpn
 27 | 
 28 | RESNET_ARCH_BLOCK = {"resnet50": [3, 4, 6, 3],
 29 |                      "resnet101": [3, 4, 23, 3]}
 30 | 
 31 | class RetinaNetFeatureExtractor(ssd_meta_arch.SSDFeatureExtractor):
 32 |   """SSD FPN feature extractor based on Resnet v1 architecture."""
 33 | 
 34 |   def __init__(self,
 35 |                is_training,
 36 |                depth_multiplier,
 37 |                min_depth,
 38 |                conv_hyperparams_fn,
 39 |                pad_to_multiple,
 40 |                backbone,
 41 |                fpn_scope_name,
 42 |                min_level=3,
 43 |                max_level=7,
 44 |                additional_layer_depth=256,
 45 |                reuse_weights=None,
 46 |                use_explicit_padding=False,
 47 |                use_depthwise=False,
 48 |                override_base_feature_extractor_hyperparams=False):
 49 |     """RetinaNet feature extractor.
 50 | 
 51 |     Args:
 52 |       is_training: whether the network is in training mode.
 53 |       depth_multiplier: float depth multiplier for feature extractor.
 54 |       min_depth: minimum feature extractor depth.
 55 |       pad_to_multiple: the nearest multiple to zero pad the input height and
 56 |         width dimensions to.
 57 |       fpn_scope_name: scope name under which to construct the feature pyramid
 58 |         network.
 59 |       additional_layer_depth: additional feature map layer channel depth.
 60 |       reuse_weights: Whether to reuse variables. Default is None.
 61 |       use_explicit_padding: Whether to use explicit padding when extracting
 62 |         features. Default is False. UNUSED currently.
 63 |       use_depthwise: Whether to use depthwise convolutions. UNUSED currently.
 64 |       override_base_feature_extractor_hyperparams: Whether to override
 65 |         hyperparameters of the base feature extractor with the one from
 66 |         `conv_hyperparams_fn`.
 67 | 
 68 |     Raises:
 69 |       ValueError: On supplying invalid arguments for unused arguments.
 70 |     """
 71 |     super(RetinaNetFeatureExtractor, self).__init__(
 72 |         is_training=is_training,
 73 |         depth_multiplier=depth_multiplier,
 74 |         min_depth=min_depth,
 75 |         conv_hyperparams_fn=conv_hyperparams_fn,
 76 |         pad_to_multiple=pad_to_multiple,
 77 |         reuse_weights=reuse_weights,
 78 |         use_explicit_padding=use_explicit_padding,
 79 |         use_depthwise=use_depthwise,
 80 |         override_base_feature_extractor_hyperparams=
 81 |         override_base_feature_extractor_hyperparams)
 82 |     if self._use_explicit_padding is True:
 83 |       raise ValueError('Explicit padding is not a valid option.')
 84 |     self._backbone = backbone
 85 |     self._fpn_scope_name = fpn_scope_name
 86 |     self._min_level = min_level
 87 |     self._max_level = max_level
 88 |     self._additional_layer_depth = additional_layer_depth
 89 | 
 90 |   def preprocess(self, resized_inputs):
 91 |     """SSD preprocessing.
 92 | 
 93 |     VGG style channel mean subtraction as described here:
 94 |     https://gist.github.com/ksimonyan/211839e770f7b538e2d8#file-readme-mdnge.
 95 |     Note that if the number of channels is not equal to 3, the mean subtraction
 96 |     will be skipped and the original resized_inputs will be returned.
 97 | 
 98 |     Args:
 99 |       resized_inputs: a [batch, height, width, channels] float tensor
100 |         representing a batch of images.
101 | 
102 |     Returns:
103 |       preprocessed_inputs: a [batch, height, width, channels] float tensor
104 |         representing a batch of images.
105 |     """
106 |     if resized_inputs.shape.as_list()[3] == 3:
107 |       channel_means = [123.68, 116.779, 103.939]
108 |       return resized_inputs - [[channel_means]]
109 |     else:
110 |       return resized_inputs
111 | 
112 |   def extract_features(self, preprocessed_inputs):
113 |     """Extract features from preprocessed inputs.
114 | 
115 |     Args:
116 |       preprocessed_inputs: a [batch, height, width, channels] float tensor
117 |         representing a batch of images.
118 | 
119 |     Returns:
120 |       feature_maps: a list of tensors where the ith tensor has shape
121 |         [batch, height_i, width_i, depth_i]
122 |     """
123 |     preprocessed_inputs = shape_utils.check_min_image_dim(
124 |         129, preprocessed_inputs)
125 |     with tf.variable_scope(
126 |         self._fpn_scope_name, reuse=self._reuse_weights) as scope:
127 |         if self._backbone in list(RESNET_ARCH_BLOCK.keys()):
128 |           block_layers = RESNET_ARCH_BLOCK[self._backbone]
129 |         else:
130 |           raise ValueError("Unknown backbone found! Only resnet50 or resnet101 is allowed!")
131 |         image_features = retinanet_fpn(inputs=preprocessed_inputs, 
132 |                                        block_layers=block_layers,
133 |                                        depth=self._additional_layer_depth,
134 |                                        is_training=self._is_training)
135 |         return [image_features[x] for x in range(self._min_level, self._max_level+1)]
136 | 
137 | 
138 | class RetinaNet50FeatureExtractor(RetinaNetFeatureExtractor):
139 |   """Resnet 50 RetinaNet feature extractor."""
140 |   def __init__(self,
141 |                is_training,
142 |                depth_multiplier,
143 |                min_depth,
144 |                conv_hyperparams_fn,
145 |                pad_to_multiple,
146 |                backbone='resnet50',
147 |                additional_layer_depth=256,
148 |                reuse_weights=None,
149 |                use_explicit_padding=False,
150 |                use_depthwise=False,
151 |                override_base_feature_extractor_hyperparams=False):
152 |     """
153 |     Args:
154 |       is_training: whether the network is in training mode.
155 |       depth_multiplier: float depth multiplier for feature extractor.
156 |         UNUSED currently.
157 |       min_depth: minimum feature extractor depth. UNUSED Currently.
158 |       pad_to_multiple: the nearest multiple to zero pad the input height and
159 |         width dimensions to.
160 |       additional_layer_depth: additional feature map layer channel depth.
161 |       reuse_weights: Whether to reuse variables. Default is None.
162 |       use_explicit_padding: Whether to use explicit padding when extracting
163 |         features. Default is False. UNUSED currently.
164 |       use_depthwise: Whether to use depthwise convolutions. UNUSED currently.
165 |       override_base_feature_extractor_hyperparams: Whether to override
166 |         hyperparameters of the base feature extractor with the one from
167 |         `conv_hyperparams_fn`.
168 |     """
169 |     super(RetinaNet50FeatureExtractor, self).__init__(
170 |         is_training=is_training,
171 |         depth_multiplier=depth_multiplier,
172 |         min_depth=min_depth,
173 |         conv_hyperparams_fn=conv_hyperparams_fn,
174 |         pad_to_multiple=pad_to_multiple,
175 |         backbone='resnet50',
176 |         fpn_scope_name='retinanet50',
177 |         additional_layer_depth=additional_layer_depth,
178 |         reuse_weights=reuse_weights,
179 |         use_explicit_padding=use_explicit_padding,
180 |         use_depthwise=use_depthwise,
181 |         override_base_feature_extractor_hyperparams=
182 |         override_base_feature_extractor_hyperparams)
183 | 
184 | class RetinaNet101FeatureExtractor(RetinaNetFeatureExtractor):
185 |   """Resnet 101 RetinaNet feature extractor."""
186 |   def __init__(self,
187 |                is_training,
188 |                depth_multiplier,
189 |                min_depth,
190 |                conv_hyperparams_fn,
191 |                pad_to_multiple,
192 |                backbone='resnet101',
193 |                additional_layer_depth=256,
194 |                reuse_weights=None,
195 |                use_explicit_padding=False,
196 |                use_depthwise=False,
197 |                override_base_feature_extractor_hyperparams=False):
198 |     """
199 |     Args:
200 |       is_training: whether the network is in training mode.
201 |       depth_multiplier: float depth multiplier for feature extractor.
202 |         UNUSED currently.
203 |       min_depth: minimum feature extractor depth. UNUSED Currently.
204 |       pad_to_multiple: the nearest multiple to zero pad the input height and
205 |         width dimensions to.
206 |       additional_layer_depth: additional feature map layer channel depth.
207 |       reuse_weights: Whether to reuse variables. Default is None.
208 |       use_explicit_padding: Whether to use explicit padding when extracting
209 |         features. Default is False. UNUSED currently.
210 |       use_depthwise: Whether to use depthwise convolutions. UNUSED currently.
211 |       override_base_feature_extractor_hyperparams: Whether to override
212 |         hyperparameters of the base feature extractor with the one from
213 |         `conv_hyperparams_fn`.
214 |     """
215 |     super(RetinaNet101FeatureExtractor, self).__init__(
216 |         is_training=is_training,
217 |         depth_multiplier=depth_multiplier,
218 |         min_depth=min_depth,
219 |         conv_hyperparams_fn=conv_hyperparams_fn,
220 |         pad_to_multiple=pad_to_multiple,
221 |         backbone='resnet101',
222 |         fpn_scope_name='retinanet101',
223 |         additional_layer_depth=additional_layer_depth,
224 |         reuse_weights=reuse_weights,
225 |         use_explicit_padding=use_explicit_padding,
226 |         use_depthwise=use_depthwise,
227 |         override_base_feature_extractor_hyperparams=
228 |         override_base_feature_extractor_hyperparams)


--------------------------------------------------------------------------------
/train.sh:
--------------------------------------------------------------------------------
1 | export PYTHONPATH="$PYTHONPATH:/PATH/TO/models/research:/PATH/TO/models/research/slim"
2 | 
3 | python3 model_main.py \
4 | --model_dir="train" --pipeline_config_path="retinanet_50_train.config" 


--------------------------------------------------------------------------------