├── .gitignore ├── .gitmodules ├── README.md ├── config.py ├── data_loader.py ├── examples ├── 2347567.jpg └── 2405273.jpg ├── main.py ├── networks ├── __init__.py ├── base.py ├── base_net │ ├── __init__.py │ └── net.py ├── image_feat_net │ ├── __init__.py │ ├── net.py │ ├── resnet101 │ │ ├── __init__.py │ │ └── net.py │ ├── roi_pooling_layer │ │ ├── __init__.py │ │ ├── roi_pooling.so │ │ ├── roi_pooling_op.cc │ │ ├── roi_pooling_op.cu.o │ │ ├── roi_pooling_op.py │ │ ├── roi_pooling_op_gpu.cu.cc │ │ ├── roi_pooling_op_gpu.h │ │ ├── roi_pooling_op_grad.py │ │ ├── roi_pooling_op_test.py │ │ └── work_sharder.h │ └── vgg16 │ │ ├── __init__.py │ │ └── net.py ├── net_wrapper.py ├── pair_net │ ├── __init__.py │ └── net.py └── text_feat_net │ ├── __init__.py │ └── net.py ├── requirements.txt ├── setup.sh ├── test.py └── utils.py /.gitignore: -------------------------------------------------------------------------------- 1 | *.npy 2 | **.png 3 | **.mat 4 | **.swn 5 | **.swp 6 | test.jpg 7 | train/* 8 | **.tar.gz 9 | **.swo 10 | **/__pycache__/** 11 | checkpoints/* 12 | visualization/* 13 | matlab_model/* 14 | nlvd_evaluation/* 15 | **.pyc 16 | **.pkl 17 | **.json 18 | scripts/* 19 | edge_boxes_with_python/* 20 | 21 | -------------------------------------------------------------------------------- /.gitmodules: -------------------------------------------------------------------------------- 1 | [submodule "nlvd_evaluation"] 2 | path = nlvd_evaluation 3 | url = https://github.com/YutingZhang/nlvd_evaluation.git 4 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # TensorFlow Implementation of DBNet 2 | 3 | This repository is the **TensorFlow** implementation of DBNet, a method for localizing and detecting visual entities with natural language queries. DBNet is proposed in the follow paper: 4 | 5 | **[Discriminative Bimodal Networks for Visual Localization and Detection with Natural Language Queries](https://arxiv.org/abs/1704.03944)**,
6 | [Yuting Zhang](http://www.ytzhang.net/), Luyao Yuan, Yijie Guo, Zhiyuan He, I-An Huang, [Honglak Lee](https://web.eecs.umich.edu/~honglak/index.html)
7 | In *IEEE Conference on Computer Vision and Pattern Recognition (CVPR)*, 2017. **spotlight** 8 | 9 | Remarks: 10 | 11 | - The results in the above paper are obtained with the Caffe+MATLAB implementation, which is available at https://github.com/YutingZhang/dbnet-caffe-matlab 12 | - This repository uses the evaluation protocol published together with the above paper, the implementation of which is at https://github.com/YutingZhang/nlvd_evaluation . It has been included here as a git submodule (see below for instructions on cloning submodules). 13 | 14 | ## How to clone this repository 15 | 16 | **This GIT repository have submodules**, please use the follow command to clone it. 17 | 18 | git clone git clone --recursive https://github.com/yuanluya/nldet_TensorFlow 19 | 20 | If you have clone the repository without the `--recursive` flag, you can do `git submodule update --init --recursive` in your local repository folder. 21 | 22 | The evaluation submodule requires additional setup steps. Please refer to [`./nlvd_evaluation/README.md`] (https://github.com/YutingZhang/nlvd_evaluation) 23 | 24 | ## Detection examples 25 | 26 | Here are two detection examples: 27 | 28 |

29 | 30 |

31 | 32 | ## Introduction to DBNet 33 | 34 | DBNet is a two-pathway deep neural network framework. It uses two separate pathways to extract visual and linguistic features, and uses a discriminative network to compute the matching score between the image region and the text phrase. DBNet is trained with a classifier with extensive use of negative samples. The training objective encourages better localization on single images, incorporates text phrases in a broad range, and properly pairs image regions with text phrases into positive and negative examples. 35 | 36 | For more details about DBNet, please refer to [the paper](https://arxiv.org/abs/1704.03944). 37 | 38 | ## Prerequisites 39 | 40 | * Python 3.3+ 41 | * [TensorFlow](https://www.TensorFlow.org/install/install_linux) 1.x: `pip3 install --user TensorFlow-gpu` 42 | * Python-OpenCV 3.2.0: `pip3 install --user opencv-python` 43 | * [Pyx 0.14.1](http://pyx.sourceforge.net/): `pip3 install --user pyx` 44 | * [PIL 4.0.0](http://pillow.readthedocs.io/en/3.4.x/index.html): `pip3 install --user Pillow` 45 | 46 | If you have admin/root access to your workstation, you can remove `--user` and use `sudo` to install them into the system folder. 47 | 48 | ## What are included 49 | 50 | - Demo using pretrained models (detection and visualization on individual images) 51 | - Training code 52 | - Evaluation code 53 | 54 | ## Data to download 55 | 56 | - The [Visual Genome Images](http://visualgenome.org/api/v0/api_home.html) dataset. 57 | - [Spell-corrected text annotations](http://www.ytzhang.net/files/dbnet/data/vg_v1_json.tar.gz) for Visual Genome. 58 | 59 | - *Remark:* if you have set up the evaluation toolbox in `./nlvd_evaluation`, the above data should have been available already. You will only need to update the data paths in the configure file. 60 | 61 | - [Cached EdgeBoxes](http://www.ytzhang.net/files/dbnet/data/vg-v1-edgebox.tar.gz) for the Visual Genome images 62 | 63 | ## Pretrained Models 64 | 65 | - VGGNet-16 and RESNET-101 faster RCNN model pretrained on PASCAL VOC 66 | - Our pretrained VGGNet-16 and RESNET-101 based DBNet model. 67 | 68 | The pretrained models can be obtained via [this link](http://www.ytzhang.net/files/dbnet/tensorflow/dbnet-pretrained.tar.gz). This model was trained according to the following procedure from scratch. It outperforms the model used in the paper slightly. Its evaluation results are summarized as follows. 69 | 70 | - Localization 71 | 72 | | IoU Threshold| 0.1 | 0.2 | 0.3 | 0.4 | 0.5 | 0.6 | 0.7 | 73 | | --- | --- | --- | --- | --- | --- | --- | --- | 74 | | Recall | 56.6 | 47.8 | 40.1 | 32.4 | 25.0| 17.6 | 10.7 | 75 | 76 | |Top Overlap Median | Top Overlap Mean | 77 | | --- | --- | 78 | | 0.174 | 0.270 | 79 | 80 | - Detection for the level-0 query set: 81 | 82 | |Threshold | gAP | mAP | 83 | | --- | --- | --- | 84 | | 0.3 | 25.3 | 49.8 | 85 | | 0.5 | 12.3 | 31.4 | 86 | | 0.7 | 2.6 | 12.4 | 87 | 88 | - Detection for the level-1 query set: 89 | 90 | |Threshold | gAP | mAP | 91 | | --- | --- | --- | 92 | | 0.3 | 22.8 | 46.7 | 93 | | 0.5 | 11.2 | 29.7 | 94 | | 0.7 | 2.4 | 12.0 | 95 | 96 | 97 | - Detection for the level-2 query set: 98 | 99 | |Threshold | gAP | mAP | 100 | | --- | --- | --- | 101 | | 0.3 | 9.6 | 28.4 | 102 | | 0.5 | 5.0 | 19.0 | 103 | | 0.7 | 1.2 | 8.2 | 104 | 105 | ## Code Overview 106 | 107 | - `main.py`: contains the main function to run the experiment and evaluate the model. 108 | - `test.py`: utilities for the test phase 109 | - `utils.py`: utility functions to read in data to neural networks. 110 | - `config.py`: definitions of the model hyperparameters, and training and testing configurations. 111 | - `networks`: the subfolder for the model files. 112 | 113 | ## Usage 114 | 115 | You can use `python3 main.py` to run our code with default config setting, see `config.py` for detailed configuration definitions. You can overwrite the default configuration in `config.py` by parsing the corresponding arguments to `main.py` (see the examples later in this section). 116 | 117 | ### Detecting and Visualizing on Sample Images 118 | 119 | #### Demo on Images from Visual Genome (Quick Demo) 120 | 121 | You can run a quick demo on Visual Genome images with a user-specified query. 122 | 123 | - Download the Visual Genome images, text annotations, EdgeBox cache, and our pretrained model. 124 | - Set up the `ENV_PATHS.IMAGE_PATH` and `ENV_PATHS.IMAGE_PATH` accordingly in `config.py`. 125 | - Put the pretrained model in the same directory as defined in `config.py`. 126 | - Create a json file with a list of image ids (just the number) that you want to do detection with and set the `--IMAGE_LIST_DIR` in `config.py` to this file's directory. 127 | - After that, run 128 | 129 | python3 main.py --MODE vis --PHRASE_INPUT 'your text query .' 130 | 131 | - You should be able to view the visualization of the detection results in the `visualization` folder created under the root of the project. 132 | - Make sure you have [Pyx](http://pyx.sourceforge.net/) and [PIL](http://pillow.readthedocs.io/en/3.4.x/index.html) installed to draw the result. `pdflatex` is also needed. 133 | 134 | #### Demo on Other Images 135 | 136 | To perform detection on non-Visual-Genome images, an external region proposal method is needed. Our code supports EdgeBox. You can download [the EdgeBox python interface](https://github.com/dculibrk/edge_boxes_with_python) to the repository root and run our code. Please make sure that the `ENV_PATHS.EDGE_BOX_RPN` is pointing to location of `edge_boxes.py`. The test procedure is the same as testing on Visual Genome images, except that, you will need to use **absolute paths** in the json file rather than image ids to list the test images. 137 | 138 | ### Training DBNet 139 | 140 | 1. Download images from [the Visual Genome website](http://visualgenome.org/api/v0/api_home.html) and our spell-checked text annotations. 141 | 2. Change `config.py` according to your data paths. 142 | 3. Either download our [trained model](#pretrained-model) to finetuning or perform training from scratch. 143 | 4. To finetune a pretrained model, please download and make sure `config.py` has the correct paths to the two `.npy` files (one is for the image pathway, and the other one is for the text pathway). 144 | 145 | #### Training from Scratch 146 | 147 | To train from scratch, we recommend using the [faster RCNN](https://arxiv.org/abs/1506.01497) to initialize the image pathway and randomly initialize the text pathway with our default parameters. After that, the DBNet model can be trained in 3 phases. 148 | 149 | - Phase 1: Fix the image net and use the base learning rate for the text model until the loss converges. 150 | 151 | `python3 main.py --PHASE phase1 --text_lr 1e-4 --image_lr_conv 0 --image_lr_region 0 --IMAGE_FINE_TUNE_MODEL frcnn_Region_Feat_Net.npy --TEXT_FINE_TUNE_MODEL XXX.npy --MAX_ITERS 50000` 152 | 153 | - Phase 2: Tune both pathway together without changing the base learning rate. To try out other configurations, change plese change `config.py`. 154 | 155 | `python3 main.py --PHASE phase2 --text_lr 1e-4 --image_lr_conv 1e-3 --image_lr_region 1e-3 --INIT_SNAPSHOT phase1 --INIT_ITER 50000 --MAX_ITERS 150000` 156 | 157 | - Phase 3: Decrease the learning rate for all pathways by a factor of 10 and train the model further. 158 | 159 | `python3 main.py --PHASE phase3 --INIT_SNAPSHOT phase2 --INIT_ITER 200000 --MAX_ITERS 100000` 160 | 161 | Model snapshots will be saved every `--SAVE_ITERS` to `--SNAPSHOT_DIR`. We name the snapshots as `nldet_[PHASE]_[ITER]`. 162 | 163 | ### Benchmarking on Visual Genome 164 | 165 | To test with pretrained model, you can place `.npy` files to the default directory and run `python3 main.py --MODE test`. To test TensorFlow models trained from scratch, please change `--INIT_SNAPSHOT` and `--INIT_ITER` flags accordingly. 166 | 167 | The detection results will be saved in a subfolder `tmp_output` under the directory `nlvd_evaluation/results/vg_v1/dbnet_[IMAGE MODEL]/` in `nlvd_evaluation` submodule. `IMAGE MODEL` refers to the model used in the image path way and can be set by `--IMAGE_MODEL` flag in `config.py`. By default `--IMAGE_MODEL` is set to `vgg16` and our model also supports `resnet101`. These tempory results will be merged together and saved in a `.txt` file, which can be used by our evaluation code directly. As long as results in `tmp_output` are saved, testing process can be resumed at anytime. Change the `--LEVEL` flag in `config.py` to perform the three-level tests in the paper. 168 | 169 | `python3 main.py --MODE test --LEVEL level_0 --INIT_SNAPSHOT phase3 --INIT_ITER 300000` 170 | 171 | ### Evaluation 172 | 173 | The evaluation and dataset development code is cloned from the [nlvd_evaluation](https://github.com/YutingZhang/nlvd_evaluation) repository as a submodule of this code. You can refer to [this page](https://github.com/YutingZhang/nlvd_evaluation/tree/master/evaluation) for more detailed instructions for how to compute the performance metrics. 174 | 175 | ## Contributors 176 | 177 | This repository is mainly contributed by [Luyao Yuan](https://github.com/yuanluya) and [Binghao Deng](https://github.com/bhdeng). The evaluation code is provided by [Yijie Guo](https://github.com/guoyijie) 178 | -------------------------------------------------------------------------------- /config.py: -------------------------------------------------------------------------------- 1 | """ Configuration for the network 2 | """ 3 | import os 4 | import os.path as osp 5 | import sys 6 | from easydict import EasyDict as edict 7 | import tensorflow as tf 8 | 9 | ############################################### 10 | #set global configuration for network training# 11 | ############################################### 12 | flags = tf.app.flags 13 | FLAGS = flags.FLAGS 14 | #model hyperparameters 15 | flags.DEFINE_string('IMAGE_MODEL', 'vgg16', 'which network to use for image pathway, ') 16 | flags.DEFINE_float('text_lr', 1e-5, 'learning rate for the text end') 17 | flags.DEFINE_float('image_lr_conv', 1e-4, 'learning rate for the image end') 18 | flags.DEFINE_float('image_lr_region', 1e-4, 'learning rate for the image end') 19 | flags.DEFINE_integer('batch_size', 2, 'number of images in a batch sent to the network') 20 | flags.DEFINE_integer('pair_net_batch_size', 128, 'number of image and text pair sent to the pair net in a subbatch') 21 | #device layout 22 | flags.DEFINE_integer('DEVICE_NUM', 0, 'GPU device ID') 23 | flags.DEFINE_integer('NUM_PROCESSORS', 4, 'Number of processor for data loading') 24 | flags.DEFINE_integer('DATA_LOADER_CAPACITY', 10, 'Maximum number of batches saved in the data loader') 25 | #training mode 26 | flags.DEFINE_string('MODE', "train", 'train|test|val') 27 | flags.DEFINE_boolean('DEBUG', False, 'whether run in tensorflow debug mode') 28 | flags.DEFINE_string('PHASE', 'phase1', 'phase1|phase2') 29 | flags.DEFINE_string('IMAGE_FINE_TUNE_MODEL', 'Region_Feat_Net.npy', 30 | 'relative path to networks/image_feat_net//net.py depends on the choice of --IMAGE_MODEL') 31 | flags.DEFINE_string('TEXT_FINE_TUNE_MODEL', 'vgg16_Text_Feat_Net.npy', 'relative path to networks/text_feat_net/net.py') 32 | flags.DEFINE_string('INIT_SNAPSHOT', 'phase1', 'init train from which phase') 33 | flags.DEFINE_integer('INIT_ITER', 0, 'init train from which iteration, together with INIT_SNAPSHOT') 34 | flags.DEFINE_string('SNAPSHOT_DIR', 'checkpoints', 'init train from which phase') 35 | flags.DEFINE_boolean('RESTORE_ALL', False, 'restore model with all variables (concern with momentum issue)') 36 | #test configs if testing 37 | flags.DEFINE_string('LEVEL', 'level_0', 'level_0|level_1|level_2') 38 | flags.DEFINE_integer('TOP_NUM_RPN', 500, 'doing nms in the top k boxes based on the prediction score') 39 | flags.DEFINE_boolean('INCLUDE_GT_BOX', False, 'include ground truth box in final test box') 40 | #visualization output 41 | flags.DEFINE_string('VIS_DIR', 'visualization', 'save image detection example') 42 | flags.DEFINE_string('PHRASE_INPUT', 'A man in red.', 'query phrase to do detections') 43 | flags.DEFINE_string('IMAGE_LIST_DIR', 'image_examples.json', 'a file specify which images to do visualization, ' 44 | 'image id if images in VISUAL GENOME, otherwise absolute directory of the images') 45 | flags.DEFINE_integer('VIS_NUM', 3, 'draw how top-x detected regions') 46 | #training infos 47 | flags.DEFINE_integer('MAX_ITERS', float('inf'), 'Maxiumum running iteration') 48 | flags.DEFINE_integer('PRINT_ITERS', 1, 'Print data each print_iters') 49 | flags.DEFINE_integer('SAVE_ITERS', 2000, 'Frequency of saving checkpoints') 50 | 51 | ############################################### 52 | # set global configuration for data reading # 53 | ############################################### 54 | DATA_PATH = osp.abspath(osp.join(osp.dirname(__file__), 'data')) 55 | ENV_PATHS = edict() 56 | 57 | # need to be moved to data path 58 | ENV_PATHS.IMAGE_PATH = '/mnt/brain3/datasets/VisualGenome/images' 59 | ENV_PATHS.EDGEBOX_PATH = '/mnt/brain2/scratch/yutingzh/object-det-cache/nldet_cache/region_proposal_cache/vg/edgebox' 60 | ENV_PATHS.EDGE_BOX_RPN = '/mnt/brain1/scratch/yuanluya/nldet_tensorflow/edge_boxes_with_python' 61 | ENV_PATHS.RAW_DATA = osp.abspath(osp.join(DATA_PATH, 'region_description.json')) 62 | ENV_PATHS.METEOR = osp.abspath(osp.join(DATA_PATH, 'meteor.json')) #upper triangle matrix 63 | ENV_PATHS.FREQUENCY = osp.abspath(osp.join(DATA_PATH, 'freq.json')) 64 | ENV_PATHS.SPLIT = osp.abspath(osp.join(DATA_PATH, 'densecap_splits.json')) 65 | ENV_PATHS.LEVEL1_TEST = osp.abspath(osp.join(DATA_PATH, 'level1_im2p.json')) 66 | ENV_PATHS.LEVEL2_TEST = osp.abspath(osp.join(DATA_PATH, 'level2_im2p.json')) 67 | 68 | ############################################### 69 | # set global configuration for data sampling # 70 | ############################################### 71 | DS_CONFIG = edict() 72 | 73 | DS_CONFIG.thre_neg = 0.1 74 | DS_CONFIG.thre_pos = 0.9 75 | DS_CONFIG.pos_loss_weight = 1 76 | DS_CONFIG.neg_loss_weight = 1 77 | DS_CONFIG.rest_loss_weight = 1 78 | DS_CONFIG.meteor_thred = 0.3 79 | DS_CONFIG.text_tensor_sequence_length = 256 80 | DS_CONFIG.text_rand_sample_size = 100 81 | DS_CONFIG.target_size = 600 82 | DS_CONFIG.max_size = 1000 83 | DS_CONFIG.edge_box_high_rank_num = 100 84 | DS_CONFIG.edge_box_random_num = 50 85 | -------------------------------------------------------------------------------- /data_loader.py: -------------------------------------------------------------------------------- 1 | import tensorflow as tf 2 | from multiprocessing import Condition, Lock, Process, Manager 3 | import random 4 | #from utils import train_ids, test_ids, get_data 5 | from utils import train_ids, test_ids, get_data 6 | import pdb 7 | 8 | 9 | class DataLoader: 10 | """ Class for loading data 11 | Attributes: 12 | num_processor: an integer indicating the number of processors 13 | for loading the data, normally 4 is enough 14 | capacity: an integer indicating the capacity of the data load 15 | queue, default set to 10 16 | batch_size: an integer indicating the batch size for each 17 | extraction from the data load queue 18 | phase: an string indicating the phase of the data loading process, 19 | can only be 'train' or 'test' 20 | """ 21 | def __init__(self, num_processor, batch_size, phase, 22 | batch_idx_init = 0, data_ids_init = train_ids, capacity = 10): 23 | self.num_processor = num_processor 24 | self.batch_size = batch_size 25 | self.data_load_capacity = capacity 26 | self.manager = Manager() 27 | self.batch_lock = Lock() 28 | self.mutex = Lock() 29 | self.cv_full = Condition(self.mutex) 30 | self.cv_empty = Condition(self.mutex) 31 | self.data_load_queue = self.manager.list() 32 | self.cur_batch = self.manager.list([batch_idx_init]) 33 | self.processors = [] 34 | if phase == 'train': 35 | self.data_ids = self.manager.list(data_ids_init) 36 | elif phase == 'test': 37 | self.data_ids = self.manager.list(test_ids) 38 | else: 39 | raise ValueError('Could not set phase to %s' % phase) 40 | 41 | def __load__(self): 42 | while True: 43 | image_dicts = [] 44 | self.batch_lock.acquire() 45 | image_ids = self.data_ids[self.cur_batch[0] * self.batch_size : 46 | (self.cur_batch[0] + 1) * self.batch_size] 47 | self.cur_batch[0] += 1 48 | if (self.cur_batch[0] + 1) * self.batch_size >= len(self.data_ids): 49 | self.cur_batch[0] = 0 50 | random.shuffle(self.data_ids) 51 | self.batch_lock.release() 52 | 53 | data = get_data(image_ids) 54 | 55 | self.cv_full.acquire() 56 | if len(self.data_load_queue) > self.data_load_capacity: 57 | self.cv_full.wait() 58 | self.data_load_queue.append(data) 59 | self.cv_empty.notify() 60 | self.cv_full.release() 61 | 62 | def start(self): 63 | for _ in range(self.num_processor): 64 | p = Process(target = self.__load__) 65 | p.start() 66 | self.processors.append(p) 67 | 68 | def get_batch(self): 69 | self.cv_empty.acquire() 70 | if len(self.data_load_queue) == 0: 71 | self.cv_empty.wait() 72 | batch_data = self.data_load_queue.pop() 73 | self.cv_full.notify() 74 | self.cv_empty.release() 75 | return batch_data 76 | 77 | def get_status(self): 78 | self.batch_lock.acquire() 79 | current_cur_batch = self.cur_batch[0] 80 | current_data_ids = self.data_ids 81 | self.batch_lock.release() 82 | return {'batch_idx': int(current_cur_batch), 'data_ids': list(current_data_ids)} 83 | 84 | def stop(self): 85 | for p in self.processors: 86 | p.terminate() 87 | -------------------------------------------------------------------------------- /examples/2347567.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yuanluya/dbnet_tensorflow/991bba69fed28f96dd289bd393d566098f3185c2/examples/2347567.jpg -------------------------------------------------------------------------------- /examples/2405273.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yuanluya/dbnet_tensorflow/991bba69fed28f96dd289bd393d566098f3185c2/examples/2405273.jpg -------------------------------------------------------------------------------- /main.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python3 2 | import tensorflow as tf 3 | from tensorflow.python import debug as tf_debug 4 | from tensorflow.core.protobuf import config_pb2 5 | import json 6 | import os 7 | from config import FLAGS 8 | from networks.net_wrapper import NetWrapper 9 | from data_loader import DataLoader 10 | from test import test 11 | from utils import val_ids, test_ids, visualize 12 | 13 | import pdb 14 | 15 | def step(net, loader): 16 | batch_data = loader.get_batch() 17 | net.set_input(batch_data) 18 | net.forward_backward() 19 | 20 | def main(_): 21 | sess = tf.Session(config = tf.ConfigProto(allow_soft_placement = True, 22 | log_device_placement = False)) 23 | 24 | #declare networks 25 | with tf.device('/gpu: %d' % FLAGS.DEVICE_NUM): 26 | net = NetWrapper(sess, FLAGS.IMAGE_MODEL, FLAGS.image_lr_conv, FLAGS.image_lr_region, FLAGS.text_lr, 27 | FLAGS.pair_net_batch_size, FLAGS.MODE, 28 | FLAGS.IMAGE_FINE_TUNE_MODEL, FLAGS.TEXT_FINE_TUNE_MODEL) 29 | net.build() 30 | 31 | if FLAGS.DEBUG: 32 | sess = tf_debug.LocalCLIDebugWrapperSession(sess) 33 | sess.add_tensor_filter("has_inf_or_nan", tf_debug.has_inf_or_nan) 34 | 35 | net.text_net.sess = sess 36 | init = tf.global_variables_initializer() 37 | sess.run(init) 38 | 39 | #train_writer = tf.summary.FileWriter('.' + '/train', sess.graph) 40 | #restore network 41 | if FLAGS.RESTORE_ALL: 42 | restore = [] 43 | else: 44 | restore = net.varlist 45 | 46 | if net.load(sess, FLAGS.SNAPSHOT_DIR, 'nldet_%s_%d' % (FLAGS.INIT_SNAPSHOT, FLAGS.INIT_ITER), restore): 47 | print('[INIT]Successfully load model from %s_%d' % (FLAGS.INIT_SNAPSHOT, FLAGS.INIT_ITER)) 48 | elif FLAGS.MODE == 'test': 49 | print('[INIT]No Tensorflow found model for %s, test initialization' % FLAGS.INIT_SNAPSHOT) 50 | else: 51 | print('[INIT]No Tensorflow found model for %s train from scratch' % FLAGS.INIT_SNAPSHOT) 52 | 53 | if FLAGS.MODE == "train": 54 | resume_status = None 55 | status_dir = '%s/nldet_%s_%d/nldet_status_%s_%d.json' %\ 56 | (FLAGS.SNAPSHOT_DIR, FLAGS.INIT_SNAPSHOT, FLAGS.INIT_ITER, FLAGS.INIT_SNAPSHOT, FLAGS.INIT_ITER) 57 | if os.path.exists(status_dir): 58 | resume_status = json.load(open(status_dir, 'r')) 59 | print('resume from %s' % status_dir) 60 | else: 61 | print('no resume data loader status found') 62 | 63 | # initialize data loader 64 | if resume_status is None: 65 | loader = DataLoader(FLAGS.NUM_PROCESSORS, FLAGS.batch_size, FLAGS.MODE, capacity = FLAGS.DATA_LOADER_CAPACITY) 66 | else: 67 | loader = DataLoader(FLAGS.NUM_PROCESSORS, FLAGS.batch_size, FLAGS.MODE, 68 | resume_status['batch_idx'], resume_status['data_ids'], FLAGS.DATA_LOADER_CAPACITY) 69 | loader.start() 70 | 71 | current_iter = FLAGS.INIT_ITER + 1 72 | while current_iter <= FLAGS.MAX_ITERS: 73 | step(net, loader) 74 | if current_iter % FLAGS.PRINT_ITERS == 0: 75 | net.get_output(current_iter) 76 | if current_iter % FLAGS.SAVE_ITERS == 0: 77 | net.save(sess, FLAGS.SNAPSHOT_DIR, 'nldet_%s_%d' % (FLAGS.PHASE, current_iter)) 78 | saving_status = loader.get_status() 79 | json.dump(saving_status, open('%s/nldet_%s_%d/nldet_status_%s_%d.json' % \ 80 | (FLAGS.SNAPSHOT_DIR, FLAGS.PHASE, current_iter, FLAGS.PHASE, current_iter), 'w')) 81 | print('save data loader status to nldet_status_%s_%d.json' % (FLAGS.PHASE, current_iter)) 82 | current_iter += 1 83 | loader.stop() 84 | 85 | elif FLAGS.MODE == "test" or FLAGS.MODE == 'val': 86 | if FLAGS.MODE == 'test': 87 | tranverse_ids = test_ids 88 | else: 89 | tranverse_ids = val_ids 90 | if FLAGS.LEVEL != 'level_0': 91 | print('Validation set only support level-0') 92 | return 93 | for idx, tid in enumerate(tranverse_ids): 94 | print('[%d/%d]' % (idx + 1, len(tranverse_ids))) 95 | result_dir = "nlvd_evaluation/results/vg_v1/dbnet_%s" % FLAGS.IMAGE_MODEL 96 | if os.path.exists('%s/tmp_output/%s_%d.txt' % (result_dir, FLAGS.LEVEL, tid)): 97 | print('FOUND EXISTING RESULT') 98 | continue 99 | test(net, tid, FLAGS.LEVEL, result_dir, top_num = FLAGS.TOP_NUM_RPN, gt_box = FLAGS.INCLUDE_GT_BOX) 100 | os.system('cat %s/tmp_output/%s* > %s/%s.txt' % (result_dir, FLAGS.LEVEL, result_dir, FLAGS.LEVEL)) 101 | 102 | elif FLAGS.MODE == "vis": 103 | im_ids = json.load(open(FLAGS.IMAGE_LIST_DIR, 'r')) 104 | os.makedirs(FLAGS.VIS_DIR, exist_ok = True) 105 | for idx, im_id in enumerate(im_ids): 106 | detection_result = test(net, im_id, 'vis', None, 107 | top_num = FLAGS.TOP_NUM_RPN, query_phrase = [FLAGS.PHRASE_INPUT]) 108 | visualize(im_id, detection_result, FLAGS.VIS_NUM, FLAGS.PHRASE_INPUT, 109 | os.path.join(FLAGS.VIS_DIR, 'vis_' + str(idx + 1))) 110 | return 111 | 112 | if __name__ == '__main__': 113 | tf.app.run() 114 | -------------------------------------------------------------------------------- /networks/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yuanluya/dbnet_tensorflow/991bba69fed28f96dd289bd393d566098f3185c2/networks/__init__.py -------------------------------------------------------------------------------- /networks/base.py: -------------------------------------------------------------------------------- 1 | import os 2 | from glob import glob 3 | import tensorflow as tf 4 | 5 | class Model(object): 6 | """Abstract object representing an Reader model.""" 7 | def __init__(self): 8 | return 9 | 10 | def save(self, sess, checkpoint_dir, dataset_name): 11 | self.saver = tf.train.Saver() 12 | 13 | print(" [*] Saving checkpoints...") 14 | model_name = type(self).__name__ or "Reader" 15 | model_dir = dataset_name 16 | 17 | checkpoint_dir = os.path.join(checkpoint_dir, model_dir) 18 | if not os.path.exists(checkpoint_dir): 19 | os.makedirs(checkpoint_dir) 20 | self.saver.save(sess, os.path.join(checkpoint_dir, model_name)) 21 | 22 | def load(self, sess, checkpoint_dir, dataset_name, load_var): 23 | 24 | all_vars = tf.global_variables() 25 | if len(load_var) == 0: 26 | restore_vars = all_vars 27 | else: 28 | restore_vars = [var for var in all_vars if var in load_var] 29 | self.saver = tf.train.Saver(restore_vars) 30 | print(" [*] Loading checkpoints...") 31 | print(dataset_name) 32 | model_dir = dataset_name 33 | checkpoint_dir = os.path.join(checkpoint_dir, model_dir) 34 | 35 | ckpt = tf.train.get_checkpoint_state(checkpoint_dir) 36 | if ckpt and ckpt.model_checkpoint_path: 37 | ckpt_name = os.path.basename(ckpt.model_checkpoint_path) 38 | self.saver.restore(sess, os.path.join(checkpoint_dir, ckpt_name)) 39 | return True 40 | else: 41 | return False 42 | 43 | -------------------------------------------------------------------------------- /networks/base_net/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yuanluya/dbnet_tensorflow/991bba69fed28f96dd289bd393d566098f3185c2/networks/base_net/__init__.py -------------------------------------------------------------------------------- /networks/base_net/net.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import os 3 | import tensorflow as tf 4 | import pdb 5 | 6 | class BaseNet: 7 | """ Class for basic net operations and structure 8 | """ 9 | def __init__(self, sess): 10 | self.sess = sess 11 | self.ys = [] 12 | self.xs = [] 13 | self.grad_ys = None 14 | self.gradients_pool = {} 15 | self.average_grads = {} 16 | self.p_grads = [] 17 | self.p_batch_sizes = 0.0 18 | self.current_batch_size = None 19 | 20 | #placeholders required outside for gradient accumulate 21 | self.batch_sizes = tf.placeholder(tf.float32, name = 'subbatch_sizes') 22 | self.batch_num = tf.placeholder(tf.int32, name = 'subbatch_nums') 23 | 24 | def accumulate(self): 25 | #accumulate gradients 26 | self.gradients = tf.gradients(self.ys, self.xs, grad_ys = self.grad_ys) 27 | 28 | for idx, var in enumerate(self.xs): 29 | self.gradients_pool[var] = tf.Variable(initial_value = np.zeros(1), 30 | validate_shape = False, 31 | trainable = False, 32 | dtype = tf.float32) 33 | 34 | def first_grad(): 35 | ops = [] 36 | for idx, var in enumerate(self.xs): 37 | ops.append(tf.assign(self.gradients_pool[var], self.gradients[idx], 38 | validate_shape = False)) 39 | with tf.control_dependencies(ops): 40 | return tf.no_op() 41 | 42 | def normal_grad(): 43 | ops = [] 44 | for idx, var in enumerate(self.xs): 45 | ops.append(tf.assign_add(self.gradients_pool[var], self.gradients[idx])) 46 | with tf.control_dependencies(ops): 47 | return tf.no_op() 48 | 49 | #flow_control_list = [tf.contrib.framework. 50 | # convert_to_tensor_or_sparse_tensor(grad) 51 | # for grad in self.gradients] 52 | #with tf.control_dependencies(flow_control_list): 53 | self.accumulate_grad = tf.cond(tf.equal(self.batch_num, 0), 54 | first_grad, normal_grad) 55 | 56 | #calculate final gradients for this batch 57 | for var in self.xs: 58 | self.average_grads[var] = self.gradients_pool[var] 59 | # tf.div(self.gradients_pool[var], 60 | # self.batch_sizes)) 61 | 62 | def backward(self, get_cpu_array = False): 63 | if get_cpu_array: 64 | self.p_grads = self.sess.run(list(self.average_grads.values()), 65 | feed_dict = {self.batch_sizes: 66 | self.p_batch_sizes}) 67 | else: 68 | self.sess.run(list(self.average_grads.values()), 69 | feed_dict = {self.batch_sizes: 70 | self.p_batch_sizes}) 71 | self.p_batch_sizes = 0.0 72 | return 73 | 74 | def get_input_gradients(self): 75 | return self.p_grads 76 | -------------------------------------------------------------------------------- /networks/image_feat_net/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yuanluya/dbnet_tensorflow/991bba69fed28f96dd289bd393d566098f3185c2/networks/image_feat_net/__init__.py -------------------------------------------------------------------------------- /networks/image_feat_net/net.py: -------------------------------------------------------------------------------- 1 | import tensorflow as tf 2 | import numpy as np 3 | import os 4 | import inspect 5 | import pdb 6 | 7 | class ImageFeatNet: 8 | """ Network model for Image Feat Net 9 | Attributes: 10 | sess: 11 | opt: 12 | max_batch_size: 13 | train: 14 | RegionNet_npy_path: 15 | """ 16 | def __init__(self, sess, lr_conv, lr_region, max_batch_size = 32, train = True): 17 | self.lr_conv = lr_conv 18 | self.lr_region = lr_region 19 | self.opt_conv = tf.train.MomentumOptimizer(self.lr_conv, 0.9) 20 | self.opt_region = tf.train.MomentumOptimizer(self.lr_region, 0.9) 21 | #model hyperparameters 22 | self.sess = sess 23 | self.max_batch_size = max_batch_size 24 | self.train = train 25 | 26 | #physical inputs should be numpy arrays 27 | self.images = tf.placeholder(tf.float32, shape = [None, None, None, 3], 28 | name = 'image_inputs') 29 | #[batch_idx, xmin, ymin, xmax, ymax] 30 | self.rois = tf.placeholder(tf.float32, shape = [None, 5], 31 | name = 'roi_inputs') 32 | self.dropout_flag = tf.placeholder(tf.int32) 33 | self.p_images = None 34 | self.p_rois = None 35 | 36 | #physcial outputs 37 | self.p_region_feats = None 38 | 39 | def build(self, sub_net, output_grad = tf.placeholder(tf.float32), 40 | feature_dim = 4096, roi_size = 7, roi_scale = 0.0625, 41 | dropout_ratio = 0.3, weight_decay= 1e-4, batch_size = 16): 42 | 43 | #conv net base 44 | self.sub_net = sub_net 45 | self.output_grad = output_grad 46 | 47 | #optimization utility 48 | self.batch_size = batch_size 49 | assert self.batch_size < self.max_batch_size 50 | 51 | #model parameters 52 | self.parameters = {} 53 | self.parameters['feature_dim'] = feature_dim 54 | self.parameters['weight_decay'] = weight_decay 55 | self.parameters['dropout_ratio'] = dropout_ratio 56 | self.parameters['roi_size'] = roi_size 57 | self.parameters['roi_scale'] = roi_scale 58 | self.parameters['dropout_flag'] = self.dropout_flag 59 | 60 | ####################################################################### 61 | ######################## NETWORK STARTS ######################### 62 | ####################################################################### 63 | self.roi_features = self.sub_net.build(self.images, self.rois, self.parameters) 64 | self.output = tf.Variable(initial_value = 1.0, trainable = False, 65 | validate_shape = False, dtype = tf.float32) 66 | self.get_output = tf.assign(self.output, self.roi_features, 67 | validate_shape = False) 68 | 69 | #gather weight decays 70 | self.wd = tf.add_n(tf.get_collection('img_net_weight_decay'), 71 | name = 'img_net_total_weight_decay') 72 | if self.sub_net.net_type == 'Vgg16': 73 | self.extra_update = [tf.no_op()] 74 | elif self.sub_net.net_type == 'Resnet101': 75 | self.extra_update = tf.get_collection('resnet_update_ops') 76 | 77 | 78 | def accumulate(self): 79 | self.ys = [self.wd, self.roi_features] 80 | self.grad_ys = [1.0, self.output_grad] 81 | 82 | self.gradients_conv = tf.gradients(self.ys, self.sub_net.varlist_conv, grad_ys = self.grad_ys) 83 | self.gradients_region = tf.gradients(self.ys, self.sub_net.varlist_region, grad_ys = self.grad_ys) 84 | 85 | self.grad_and_vars_conv = [] 86 | self.grad_and_vars_region = [] 87 | 88 | for idx, var in enumerate(self.sub_net.varlist_conv): 89 | self.grad_and_vars_conv.append((self.gradients_conv[idx], var)) 90 | for idx, var in enumerate(self.sub_net.varlist_region): 91 | self.grad_and_vars_region.append((self.gradients_region[idx], var)) 92 | 93 | #apply gradients 94 | with tf.control_dependencies(self.gradients_conv + self.gradients_region): 95 | self.train_op = tf.group(self.opt_conv.apply_gradients(self.grad_and_vars_conv), 96 | self.opt_region.apply_gradients(self.grad_and_vars_region), *self.extra_update) 97 | 98 | def set_input(self, images, rois): 99 | self.p_images = images 100 | self.p_rois = rois 101 | 102 | def get_output(self): 103 | return self.p_roi_features 104 | 105 | def forward(self, physical_output = False): 106 | if physical_output: 107 | [self.p_roi_features] = self.sess.run([self.get_output], 108 | feed_dict = { 109 | self.images: self.p_images, 110 | self.rois: self.p_rois, 111 | self.dropout_flag: 1}) 112 | else: 113 | self.sess.run([self.get_output], 114 | feed_dict = { 115 | self.images: self.p_images, 116 | self.rois: self.p_rois, 117 | self.dropout_flag: 1}) 118 | return 119 | 120 | def backward(self): 121 | self.sess.run([self.train_op], 122 | feed_dict = { 123 | self.images: self.p_images, 124 | self.rois: self.p_rois, 125 | self.dropout_flag: 0}) 126 | return 127 | 128 | -------------------------------------------------------------------------------- /networks/image_feat_net/resnet101/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yuanluya/dbnet_tensorflow/991bba69fed28f96dd289bd393d566098f3185c2/networks/image_feat_net/resnet101/__init__.py -------------------------------------------------------------------------------- /networks/image_feat_net/resnet101/net.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import inspect 3 | import os 4 | import tensorflow as tf 5 | from tensorflow.python.ops import control_flow_ops 6 | from tensorflow.python.training import moving_averages 7 | from ..roi_pooling_layer.roi_pooling_op import roi_pool 8 | from ..roi_pooling_layer.roi_pooling_op_grad import * 9 | import pdb 10 | 11 | MOVING_AVERAGE_DECAY = 0.9997 12 | BN_DECAY = MOVING_AVERAGE_DECAY 13 | BN_EPSILON = 1e-6 14 | CONV_WEIGHT_STDDEV = 0.1 15 | UPDATE_OPS_COLLECTION = 'resnet_update_ops' # must be grouped with training op 16 | 17 | class Resnet101: 18 | def __init__(self, RegionNet_npy_path='frcnn_Region_Feat_Net.npy', train=True): 19 | #load saved model 20 | try: 21 | path = inspect.getfile(Resnet101) 22 | path = os.path.abspath(os.path.join(path, os.pardir)) 23 | RegionNet_npy_path = os.path.join(path, RegionNet_npy_path) 24 | self.data_dict = np.load(RegionNet_npy_path, encoding='latin1').item() 25 | print("Image Feat Net npy file loaded") 26 | except: 27 | print('[WARNING!]Image Feat Net npy file not found,' 28 | 'we don\'t recommend training this network from scratch') 29 | self.data_dict = {} 30 | self.is_training = train 31 | self.varlist_conv = [] 32 | self.varlist_region = [] 33 | self.net_type = 'Resnet101' 34 | self.activation = tf.nn.relu 35 | 36 | self.cvar = {} 37 | 38 | def build(self, bgr, rois, parameters, 39 | num_blocks=[3, 4, 23, 3], 40 | use_bias=False): 41 | 42 | self.bgr = bgr 43 | self.rois = rois 44 | self.weight_decay = parameters['weight_decay'] 45 | self.roi_size = parameters['roi_size'] 46 | self.roi_scale = parameters['roi_scale'] 47 | 48 | c = {} 49 | c['bottleneck'] = True 50 | c['is_training'] = tf.convert_to_tensor(self.is_training, 51 | dtype='bool', 52 | name='is_training') 53 | c['ksize'] = 3 54 | c['stride'] = 1 55 | c['use_bias'] = use_bias 56 | c['num_blocks'] = num_blocks 57 | c['stack_stride'] = 2 58 | 59 | with tf.variable_scope('scale1'): 60 | c['conv_filters_out'] = 64 61 | c['ksize'] = 7 62 | c['stride'] = 2 63 | self.conv1 = self.conv(tf.pad(self.bgr, 64 | [[0, 0], [3, 3], [3, 3], [0, 0]]), c, 'conv1', padding='VALID') 65 | self.bn_conv1 = self.bn(self.conv1, c, 'bn_conv1') 66 | self.scale1_feat = self.activation(self.bn_conv1) 67 | 68 | with tf.variable_scope('scale2'): 69 | self.scale1_pool = self._max_pool(tf.pad(self.scale1_feat, 70 | [[0, 0], [0, 1], [0, 1], [0, 0]]), ksize=3, stride=2) 71 | c['num_blocks'] = num_blocks[0] 72 | c['stack_stride'] = 1 73 | c['block_filters_internal'] = 64 74 | self.scale2_feat = self.stack(self.scale1_pool, c, '2') 75 | 76 | with tf.variable_scope('scale3'): 77 | c['num_blocks'] = num_blocks[1] 78 | c['block_filters_internal'] = 128 79 | c['stack_stride'] = 2 80 | self.scale3_feat = self.stack(self.scale2_feat, c, '3') 81 | 82 | with tf.variable_scope('scale4'): 83 | c['num_blocks'] = num_blocks[2] 84 | c['block_filters_internal'] = 256 85 | assert c['stack_stride'] == 2 86 | self.scale4_feat = self.stack(self.scale3_feat, c, '4') 87 | 88 | 89 | [self.rois_feat, _] = roi_pool(self.scale4_feat, self.rois, 90 | self.roi_size, self.roi_size, 91 | self.roi_scale) 92 | 93 | with tf.variable_scope('scale5'): 94 | c['num_blocks'] = num_blocks[3] 95 | c['block_filters_internal'] = 512 96 | assert c['stack_stride'] == 2 97 | self.scale5_feat = self.stack(self.rois_feat, c, '5', belong='region') 98 | 99 | # post-net 100 | self.final_feature = tf.reduce_mean(self.scale5_feat, reduction_indices=[1, 2], name="avg_pool") 101 | 102 | return self.final_feature 103 | 104 | def stack(self, x, c, stack_caffe_scale, belong='conv'): 105 | if c['num_blocks'] == 3: 106 | block_names = ['a', 'b', 'c'] 107 | else: 108 | block_names = ['a'] + ['b' + str(i + 1) for i in range(c['num_blocks'] - 1)] 109 | for n in range(c['num_blocks']): 110 | s = c['stack_stride'] if n == 0 else 1 111 | c['block_stride'] = s 112 | with tf.variable_scope('block%d' % (n + 1)): 113 | x = self.block(x, c, stack_caffe_scale+block_names[n], belong) 114 | return x 115 | 116 | 117 | def block(self, x, c, block_caffe_name, belong='conv'): 118 | filters_in = x.get_shape()[-1] 119 | 120 | # Note: filters_out isn't how many filters are outputed. 121 | # That is the case when bottleneck=False but when bottleneck is 122 | # True, filters_internal*4 filters are outputted. filters_internal is how many filters 123 | # the 3x3 convs output internally. 124 | m = 4 if c['bottleneck'] else 1 125 | filters_out = m * c['block_filters_internal'] 126 | 127 | shortcut = x # branch 1 128 | 129 | c['conv_filters_out'] = c['block_filters_internal'] 130 | 131 | with tf.variable_scope('a'): 132 | c['ksize'] = 1 133 | c['stride'] = c['block_stride'] 134 | x = self.conv(x, c, 'res'+block_caffe_name+'_branch2a', belong) 135 | self.cvar['res'+block_caffe_name+'_branch2a'] = x 136 | x = self.bn(x, c, 'bn'+block_caffe_name+'_branch2a', belong) 137 | self.cvar['bn'+block_caffe_name+'_branch2a'] = x 138 | x = self.activation(x) 139 | 140 | with tf.variable_scope('b'): 141 | c['ksize'] = 3 142 | c['stride'] = 1 143 | x = self.conv(x, c, 'res'+block_caffe_name+'_branch2b', belong) 144 | self.cvar['res'+block_caffe_name+'_branch2b'] = x 145 | x = self.bn(x, c, 'bn'+block_caffe_name+'_branch2b', belong) 146 | self.cvar['bn'+block_caffe_name+'_branch2b'] = x 147 | x = self.activation(x) 148 | 149 | with tf.variable_scope('c'): 150 | c['conv_filters_out'] = filters_out 151 | c['ksize'] = 1 152 | assert c['stride'] == 1 153 | x = self.conv(x, c, 'res'+block_caffe_name+'_branch2c', belong) 154 | self.cvar['res'+block_caffe_name+'_branch2c'] = x 155 | x = self.bn(x, c, 'bn'+block_caffe_name+'_branch2c', belong) 156 | self.cvar['bn'+block_caffe_name+'_branch2c'] = x 157 | 158 | with tf.variable_scope('shortcut'): 159 | if filters_out != filters_in or c['block_stride'] != 1: 160 | c['ksize'] = 1 161 | c['stride'] = c['block_stride'] 162 | c['conv_filters_out'] = filters_out 163 | shortcut = self.conv(shortcut, c, 'res'+block_caffe_name+'_branch1', belong) 164 | self.cvar['res'+block_caffe_name+'_branch1'] = shortcut 165 | shortcut = self.bn(shortcut, c, 'bn'+block_caffe_name+'_branch1', belong) 166 | self.cvar['bn'+block_caffe_name+'_branch1'] = shortcut 167 | 168 | return self.activation(x + shortcut) 169 | 170 | 171 | def bn(self, x, c, caffe_name, belong='conv'): 172 | x_shape = x.get_shape() 173 | params_shape = x_shape[-1:] 174 | 175 | if c['use_bias']: 176 | bias = self._get_variable('bias', params_shape, 177 | initializer=tf.zeros_initializer()) 178 | return x + bias 179 | 180 | 181 | axis = list(range(len(x_shape) - 1)) 182 | 183 | beta = self._get_variable('beta', 184 | caffe_name, 185 | params_shape, 186 | key='offset', 187 | initializer=tf.zeros_initializer()) 188 | gamma = self._get_variable('gamma', 189 | caffe_name, 190 | params_shape, 191 | key='scale', 192 | initializer=tf.ones_initializer()) 193 | 194 | moving_mean = self._get_variable('moving_mean', 195 | caffe_name, 196 | params_shape, 197 | key='mean', 198 | initializer=tf.zeros_initializer(), 199 | trainable=False) 200 | moving_variance = self._get_variable('moving_variance', 201 | caffe_name, 202 | params_shape, 203 | key='variance', 204 | initializer=tf.ones_initializer(), 205 | trainable=False) 206 | 207 | if belong == 'conv': 208 | self.varlist_conv.extend([beta, gamma, moving_mean, moving_variance]) 209 | elif belong == 'region': 210 | self.varlist_region.extend([beta, gamma, moving_mean, moving_variance]) 211 | 212 | # These ops will only be preformed when training. 213 | mean, variance = tf.nn.moments(x, axis) 214 | update_moving_mean = moving_averages.assign_moving_average(moving_mean, 215 | mean, BN_DECAY) 216 | update_moving_variance = moving_averages.assign_moving_average( 217 | moving_variance, variance, BN_DECAY) 218 | tf.add_to_collection(UPDATE_OPS_COLLECTION, update_moving_mean) 219 | tf.add_to_collection(UPDATE_OPS_COLLECTION, update_moving_variance) 220 | 221 | mean, variance = control_flow_ops.cond( 222 | c['is_training'], lambda: (mean, variance), 223 | lambda: (moving_mean, moving_variance)) 224 | 225 | x = tf.nn.batch_normalization(x, mean, variance, beta, gamma, BN_EPSILON) 226 | 227 | return x 228 | 229 | def _get_variable(self, 230 | name, 231 | caffe_name, 232 | shape, 233 | initializer, 234 | key='weights', 235 | dtype='float', 236 | trainable=True): 237 | 238 | "A little wrapper around tf.get_variable to do weight decay and add to" 239 | "resnet collection" 240 | if self.data_dict.get(caffe_name): 241 | initializer = tf.constant_initializer(value = self.data_dict[caffe_name][key], dtype = tf.float32) 242 | else: 243 | print('[WARNING] Resnet block with caffe name\ 244 | %s:%s was initialized by random' % (caffe_name, key)) 245 | var = tf.get_variable(name, shape=shape, initializer=initializer, 246 | dtype=dtype, trainable=trainable) 247 | if self.weight_decay > 0: 248 | weight_decay = tf.multiply(tf.nn.l2_loss(var), self.weight_decay, 249 | name = 'weight_decay') 250 | tf.add_to_collection('img_net_weight_decay', weight_decay) 251 | return var 252 | 253 | def conv(self, x, c, caffe_name, belong='conv', padding='SAME'): 254 | ksize = c['ksize'] 255 | stride = c['stride'] 256 | filters_out = c['conv_filters_out'] 257 | 258 | filters_in = x.get_shape()[-1] 259 | shape = [ksize, ksize, filters_in, filters_out] 260 | initializer = tf.truncated_normal_initializer(stddev=CONV_WEIGHT_STDDEV) 261 | weights = self._get_variable('weights', 262 | caffe_name, 263 | shape=shape, 264 | dtype='float32', 265 | initializer=initializer) 266 | if belong == 'conv': 267 | self.varlist_conv.append(weights) 268 | elif belong == 'region': 269 | self.varlist_region.append(weights) 270 | if ksize == 1 and stride == 2: 271 | padding = 'VALID' 272 | return tf.nn.conv2d(x, weights, [1, stride, stride, 1], padding=padding) 273 | 274 | 275 | def _max_pool(self, x, ksize=3, stride=2): 276 | return tf.nn.max_pool(x, 277 | ksize=[1, ksize, ksize, 1], 278 | strides=[1, stride, stride, 1], 279 | padding='VALID') 280 | -------------------------------------------------------------------------------- /networks/image_feat_net/roi_pooling_layer/__init__.py: -------------------------------------------------------------------------------- 1 | # -------------------------------------------------------- 2 | # Fast R-CNN 3 | # Copyright (c) 2015 Microsoft 4 | # Licensed under The MIT License [see LICENSE for details] 5 | # Written by Ross Girshick 6 | # -------------------------------------------------------- 7 | -------------------------------------------------------------------------------- /networks/image_feat_net/roi_pooling_layer/roi_pooling.so: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yuanluya/dbnet_tensorflow/991bba69fed28f96dd289bd393d566098f3185c2/networks/image_feat_net/roi_pooling_layer/roi_pooling.so -------------------------------------------------------------------------------- /networks/image_feat_net/roi_pooling_layer/roi_pooling_op.cc: -------------------------------------------------------------------------------- 1 | /* Copyright 2015 Google Inc. All Rights Reserved. 2 | 3 | Licensed under the Apache License, Version 2.0 (the "License"); 4 | you may not use this file except in compliance with the License. 5 | You may obtain a copy of the License at 6 | 7 | http://www.apache.org/licenses/LICENSE-2.0 8 | 9 | Unless required by applicable law or agreed to in writing, software 10 | distributed under the License is distributed on an "AS IS" BASIS, 11 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | See the License for the specific language governing permissions and 13 | limitations under the License. 14 | ==============================================================================*/ 15 | 16 | // An example Op. 17 | 18 | #include 19 | #include 20 | 21 | #include "third_party/eigen3/unsupported/Eigen/CXX11/Tensor" 22 | #include "tensorflow/core/framework/op.h" 23 | #include "tensorflow/core/framework/op_kernel.h" 24 | #include "tensorflow/core/framework/tensor_shape.h" 25 | #include "work_sharder.h" 26 | 27 | using namespace tensorflow; 28 | typedef Eigen::ThreadPoolDevice CPUDevice; 29 | 30 | REGISTER_OP("RoiPool") 31 | .Attr("T: {float, double}") 32 | .Attr("pooled_height: int") 33 | .Attr("pooled_width: int") 34 | .Attr("spatial_scale: float") 35 | .Input("bottom_data: T") 36 | .Input("bottom_rois: T") 37 | .Output("top_data: T") 38 | .Output("argmax: int32"); 39 | 40 | REGISTER_OP("RoiPoolGrad") 41 | .Attr("T: {float, double}") 42 | .Attr("pooled_height: int") 43 | .Attr("pooled_width: int") 44 | .Attr("spatial_scale: float") 45 | .Input("bottom_data: T") 46 | .Input("bottom_rois: T") 47 | .Input("argmax: int32") 48 | .Input("grad: T") 49 | .Output("output: T"); 50 | 51 | template 52 | class RoiPoolOp : public OpKernel { 53 | public: 54 | explicit RoiPoolOp(OpKernelConstruction* context) : OpKernel(context) { 55 | // Get the pool height 56 | OP_REQUIRES_OK(context, 57 | context->GetAttr("pooled_height", &pooled_height_)); 58 | // Check that pooled_height is positive 59 | OP_REQUIRES(context, pooled_height_ >= 0, 60 | errors::InvalidArgument("Need pooled_height >= 0, got ", 61 | pooled_height_)); 62 | // Get the pool width 63 | OP_REQUIRES_OK(context, 64 | context->GetAttr("pooled_width", &pooled_width_)); 65 | // Check that pooled_width is positive 66 | OP_REQUIRES(context, pooled_width_ >= 0, 67 | errors::InvalidArgument("Need pooled_width >= 0, got ", 68 | pooled_width_)); 69 | // Get the spatial scale 70 | OP_REQUIRES_OK(context, 71 | context->GetAttr("spatial_scale", &spatial_scale_)); 72 | } 73 | 74 | void Compute(OpKernelContext* context) override 75 | { 76 | // Grab the input tensor 77 | const Tensor& bottom_data = context->input(0); 78 | const Tensor& bottom_rois = context->input(1); 79 | auto bottom_data_flat = bottom_data.flat(); 80 | auto bottom_rois_flat = bottom_rois.flat(); 81 | 82 | // data should have 4 dimensions. 83 | OP_REQUIRES(context, bottom_data.dims() == 4, 84 | errors::InvalidArgument("data must be 4-dimensional")); 85 | 86 | // rois should have 2 dimensions. 87 | OP_REQUIRES(context, bottom_rois.dims() == 2, 88 | errors::InvalidArgument("rois must be 2-dimensional")); 89 | 90 | // Number of ROIs 91 | int num_rois = bottom_rois.dim_size(0); 92 | // batch size 93 | int batch_size = bottom_data.dim_size(0); 94 | // data height 95 | int data_height = bottom_data.dim_size(1); 96 | // data width 97 | int data_width = bottom_data.dim_size(2); 98 | // Number of channels 99 | int num_channels = bottom_data.dim_size(3); 100 | 101 | // construct the output shape 102 | int dims[4]; 103 | dims[0] = num_rois; 104 | dims[1] = pooled_height_; 105 | dims[2] = pooled_width_; 106 | dims[3] = num_channels; 107 | TensorShape output_shape; 108 | TensorShapeUtils::MakeShape(dims, 4, &output_shape); 109 | 110 | // Create output tensors 111 | Tensor* output_tensor = NULL; 112 | OP_REQUIRES_OK(context, context->allocate_output(0, output_shape, &output_tensor)); 113 | auto output = output_tensor->template flat(); 114 | 115 | Tensor* argmax_tensor = NULL; 116 | OP_REQUIRES_OK(context, context->allocate_output(1, output_shape, &argmax_tensor)); 117 | auto argmax = argmax_tensor->template flat(); 118 | 119 | int pooled_height = pooled_height_; 120 | int pooled_width = pooled_width_; 121 | float spatial_scale = spatial_scale_; 122 | 123 | auto shard = [pooled_height, pooled_width, spatial_scale, 124 | num_rois, batch_size, data_height, data_width, num_channels, 125 | &bottom_data_flat, &bottom_rois_flat, &output, &argmax] 126 | (int64 start, int64 limit) { 127 | for (int64 b = start; b < limit; ++b) 128 | { 129 | // (n, ph, pw, c) is an element in the pooled output 130 | int n = b; 131 | int c = n % num_channels; 132 | n /= num_channels; 133 | int pw = n % pooled_width; 134 | n /= pooled_width; 135 | int ph = n % pooled_height; 136 | n /= pooled_height; 137 | 138 | const float* bottom_rois = bottom_rois_flat.data() + n * 5; 139 | int roi_batch_ind = bottom_rois[0]; 140 | int roi_start_w = round(bottom_rois[1] * spatial_scale); 141 | int roi_start_h = round(bottom_rois[2] * spatial_scale); 142 | int roi_end_w = round(bottom_rois[3] * spatial_scale); 143 | int roi_end_h = round(bottom_rois[4] * spatial_scale); 144 | 145 | // Force malformed ROIs to be 1x1 146 | int roi_width = std::max(roi_end_w - roi_start_w + 1, 1); 147 | int roi_height = std::max(roi_end_h - roi_start_h + 1, 1); 148 | const T bin_size_h = static_cast(roi_height) 149 | / static_cast(pooled_height); 150 | const T bin_size_w = static_cast(roi_width) 151 | / static_cast(pooled_width); 152 | 153 | int hstart = static_cast(floor(ph * bin_size_h)); 154 | int wstart = static_cast(floor(pw * bin_size_w)); 155 | int hend = static_cast(ceil((ph + 1) * bin_size_h)); 156 | int wend = static_cast(ceil((pw + 1) * bin_size_w)); 157 | 158 | // Add roi offsets and clip to input boundaries 159 | hstart = std::min(std::max(hstart + roi_start_h, 0), data_height); 160 | hend = std::min(std::max(hend + roi_start_h, 0), data_height); 161 | wstart = std::min(std::max(wstart + roi_start_w, 0), data_width); 162 | wend = std::min(std::max(wend + roi_start_w, 0), data_width); 163 | bool is_empty = (hend <= hstart) || (wend <= wstart); 164 | 165 | // Define an empty pooling region to be zero 166 | float maxval = is_empty ? 0 : -FLT_MAX; 167 | // If nothing is pooled, argmax = -1 causes nothing to be backprop'd 168 | int maxidx = -1; 169 | const float* bottom_data = bottom_data_flat.data() + roi_batch_ind * num_channels * data_height * data_width; 170 | for (int h = hstart; h < hend; ++h) { 171 | for (int w = wstart; w < wend; ++w) { 172 | int bottom_index = (h * data_width + w) * num_channels + c; 173 | if (bottom_data[bottom_index] > maxval) { 174 | maxval = bottom_data[bottom_index]; 175 | maxidx = bottom_index; 176 | } 177 | } 178 | } 179 | output(b) = maxval; 180 | argmax(b) = maxidx; 181 | } 182 | }; 183 | 184 | const DeviceBase::CpuWorkerThreads& worker_threads = 185 | *(context->device()->tensorflow_cpu_worker_threads()); 186 | const int64 shard_cost = 187 | num_rois * num_channels * pooled_height * pooled_width * spatial_scale; 188 | Shard(worker_threads.num_threads, worker_threads.workers, 189 | output.size(), shard_cost, shard); 190 | } 191 | private: 192 | int pooled_height_; 193 | int pooled_width_; 194 | float spatial_scale_; 195 | }; 196 | 197 | bool ROIPoolForwardLaucher( 198 | const float* bottom_data, const float spatial_scale, const int num_rois, const int height, 199 | const int width, const int channels, const int pooled_height, 200 | const int pooled_width, const float* bottom_rois, 201 | float* top_data, int* argmax_data, const Eigen::GpuDevice& d); 202 | 203 | static void RoiPoolingKernel( 204 | OpKernelContext* context, const Tensor* bottom_data, const Tensor* bottom_rois, 205 | const float spatial_scale, const int num_rois, const int height, 206 | const int width, const int channels, const int pooled_height, 207 | const int pooled_width, const TensorShape& tensor_output_shape) 208 | { 209 | Tensor* output = nullptr; 210 | Tensor* argmax = nullptr; 211 | OP_REQUIRES_OK(context, context->allocate_output(0, tensor_output_shape, &output)); 212 | OP_REQUIRES_OK(context, context->allocate_output(1, tensor_output_shape, &argmax)); 213 | 214 | if (!context->status().ok()) { 215 | return; 216 | } 217 | 218 | ROIPoolForwardLaucher( 219 | bottom_data->flat().data(), spatial_scale, num_rois, height, 220 | width, channels, pooled_height, pooled_width, bottom_rois->flat().data(), 221 | output->flat().data(), argmax->flat().data(), context->eigen_device()); 222 | } 223 | 224 | template 225 | class RoiPoolOp : public OpKernel { 226 | public: 227 | typedef Eigen::GpuDevice Device; 228 | 229 | explicit RoiPoolOp(OpKernelConstruction* context) : OpKernel(context) { 230 | 231 | // Get the pool height 232 | OP_REQUIRES_OK(context, 233 | context->GetAttr("pooled_height", &pooled_height_)); 234 | // Check that pooled_height is positive 235 | OP_REQUIRES(context, pooled_height_ >= 0, 236 | errors::InvalidArgument("Need pooled_height >= 0, got ", 237 | pooled_height_)); 238 | // Get the pool width 239 | OP_REQUIRES_OK(context, 240 | context->GetAttr("pooled_width", &pooled_width_)); 241 | // Check that pooled_width is positive 242 | OP_REQUIRES(context, pooled_width_ >= 0, 243 | errors::InvalidArgument("Need pooled_width >= 0, got ", 244 | pooled_width_)); 245 | // Get the spatial scale 246 | OP_REQUIRES_OK(context, 247 | context->GetAttr("spatial_scale", &spatial_scale_)); 248 | } 249 | 250 | void Compute(OpKernelContext* context) override 251 | { 252 | // Grab the input tensor 253 | const Tensor& bottom_data = context->input(0); 254 | const Tensor& bottom_rois = context->input(1); 255 | 256 | // data should have 4 dimensions. 257 | OP_REQUIRES(context, bottom_data.dims() == 4, 258 | errors::InvalidArgument("data must be 4-dimensional")); 259 | 260 | // rois should have 2 dimensions. 261 | OP_REQUIRES(context, bottom_rois.dims() == 2, 262 | errors::InvalidArgument("rois must be 2-dimensional")); 263 | 264 | // Number of ROIs 265 | int num_rois = bottom_rois.dim_size(0); 266 | // batch size 267 | int batch_size = bottom_data.dim_size(0); 268 | // data height 269 | int data_height = bottom_data.dim_size(1); 270 | // data width 271 | int data_width = bottom_data.dim_size(2); 272 | // Number of channels 273 | int num_channels = bottom_data.dim_size(3); 274 | 275 | // construct the output shape 276 | int dims[4]; 277 | dims[0] = num_rois; 278 | dims[1] = pooled_height_; 279 | dims[2] = pooled_width_; 280 | dims[3] = num_channels; 281 | TensorShape output_shape; 282 | TensorShapeUtils::MakeShape(dims, 4, &output_shape); 283 | 284 | RoiPoolingKernel(context, &bottom_data, &bottom_rois, spatial_scale_, num_rois, data_height, 285 | data_width, num_channels, pooled_height_, pooled_width_, output_shape); 286 | 287 | } 288 | private: 289 | int pooled_height_; 290 | int pooled_width_; 291 | float spatial_scale_; 292 | }; 293 | 294 | // compute gradient 295 | template 296 | class RoiPoolGradOp : public OpKernel { 297 | public: 298 | explicit RoiPoolGradOp(OpKernelConstruction* context) : OpKernel(context) { 299 | 300 | // Get the pool height 301 | OP_REQUIRES_OK(context, 302 | context->GetAttr("pooled_height", &pooled_height_)); 303 | // Check that pooled_height is positive 304 | OP_REQUIRES(context, pooled_height_ >= 0, 305 | errors::InvalidArgument("Need pooled_height >= 0, got ", 306 | pooled_height_)); 307 | // Get the pool width 308 | OP_REQUIRES_OK(context, 309 | context->GetAttr("pooled_width", &pooled_width_)); 310 | // Check that pooled_width is positive 311 | OP_REQUIRES(context, pooled_width_ >= 0, 312 | errors::InvalidArgument("Need pooled_width >= 0, got ", 313 | pooled_width_)); 314 | // Get the spatial scale 315 | OP_REQUIRES_OK(context, 316 | context->GetAttr("spatial_scale", &spatial_scale_)); 317 | } 318 | 319 | void Compute(OpKernelContext* context) override 320 | { 321 | // Grab the input tensor 322 | const Tensor& bottom_data = context->input(0); 323 | const Tensor& bottom_rois = context->input(1); 324 | const Tensor& argmax_data = context->input(2); 325 | const Tensor& out_backprop = context->input(3); 326 | 327 | auto bottom_data_flat = bottom_data.flat(); 328 | auto bottom_rois_flat = bottom_rois.flat(); 329 | auto argmax_data_flat = argmax_data.flat(); 330 | auto out_backprop_flat = out_backprop.flat(); 331 | 332 | // data should have 4 dimensions. 333 | OP_REQUIRES(context, bottom_data.dims() == 4, 334 | errors::InvalidArgument("data must be 4-dimensional")); 335 | 336 | // rois should have 2 dimensions. 337 | OP_REQUIRES(context, bottom_rois.dims() == 2, 338 | errors::InvalidArgument("rois must be 2-dimensional")); 339 | 340 | OP_REQUIRES(context, argmax_data.dims() == 4, 341 | errors::InvalidArgument("argmax_data must be 4-dimensional")); 342 | 343 | OP_REQUIRES(context, out_backprop.dims() == 4, 344 | errors::InvalidArgument("out_backprop must be 4-dimensional")); 345 | 346 | // Number of ROIs 347 | int num_rois = bottom_rois.dim_size(0); 348 | // batch size 349 | int batch_size = bottom_data.dim_size(0); 350 | // data height 351 | int data_height = bottom_data.dim_size(1); 352 | // data width 353 | int data_width = bottom_data.dim_size(2); 354 | // Number of channels 355 | int num_channels = bottom_data.dim_size(3); 356 | 357 | // construct the output shape 358 | TensorShape output_shape = bottom_data.shape(); 359 | 360 | // Create output tensors 361 | Tensor* output_tensor = NULL; 362 | OP_REQUIRES_OK(context, context->allocate_output(0, output_shape, &output_tensor)); 363 | auto output = output_tensor->template flat(); 364 | 365 | int pooled_height = pooled_height_; 366 | int pooled_width = pooled_width_; 367 | float spatial_scale = spatial_scale_; 368 | 369 | auto shard = [pooled_height, pooled_width, spatial_scale, 370 | num_rois, batch_size, data_height, data_width, num_channels, 371 | &bottom_data_flat, &bottom_rois_flat, &argmax_data_flat, 372 | &out_backprop_flat, &output](int64 start, int64 limit) { 373 | for (int64 b = start; b < limit; ++b) 374 | { 375 | // (n, h, w, c) coords in bottom data 376 | int n = b; 377 | int c = n % num_channels; 378 | n /= num_channels; 379 | int w = n % data_width; 380 | n /= data_width; 381 | int h = n % data_height; 382 | n /= data_height; 383 | 384 | float gradient = 0.0; 385 | // Accumulate gradient over all ROIs that pooled this element 386 | for (int roi_n = 0; roi_n < num_rois; ++roi_n) 387 | { 388 | const float* offset_bottom_rois = bottom_rois_flat.data() + roi_n * 5; 389 | int roi_batch_ind = offset_bottom_rois[0]; 390 | // Skip if ROI's batch index doesn't match n 391 | if (n != roi_batch_ind) { 392 | continue; 393 | } 394 | 395 | int roi_start_w = round(offset_bottom_rois[1] * spatial_scale); 396 | int roi_start_h = round(offset_bottom_rois[2] * spatial_scale); 397 | int roi_end_w = round(offset_bottom_rois[3] * spatial_scale); 398 | int roi_end_h = round(offset_bottom_rois[4] * spatial_scale); 399 | 400 | // Skip if ROI doesn't include (h, w) 401 | const bool in_roi = (w >= roi_start_w && w <= roi_end_w && 402 | h >= roi_start_h && h <= roi_end_h); 403 | if (!in_roi) { 404 | continue; 405 | } 406 | 407 | int offset = roi_n * pooled_height * pooled_width * num_channels; 408 | const float* offset_top_diff = out_backprop_flat.data() + offset; 409 | const int* offset_argmax_data = argmax_data_flat.data() + offset; 410 | 411 | // Compute feasible set of pooled units that could have pooled 412 | // this bottom unit 413 | 414 | // Force malformed ROIs to be 1x1 415 | int roi_width = std::max(roi_end_w - roi_start_w + 1, 1); 416 | int roi_height = std::max(roi_end_h - roi_start_h + 1, 1); 417 | 418 | const T bin_size_h = static_cast(roi_height) 419 | / static_cast(pooled_height); 420 | const T bin_size_w = static_cast(roi_width) 421 | / static_cast(pooled_width); 422 | 423 | int phstart = floor(static_cast(h - roi_start_h) / bin_size_h); 424 | int phend = ceil(static_cast(h - roi_start_h + 1) / bin_size_h); 425 | int pwstart = floor(static_cast(w - roi_start_w) / bin_size_w); 426 | int pwend = ceil(static_cast(w - roi_start_w + 1) / bin_size_w); 427 | 428 | phstart = std::min(std::max(phstart, 0), pooled_height); 429 | phend = std::min(std::max(phend, 0), pooled_height); 430 | pwstart = std::min(std::max(pwstart, 0), pooled_width); 431 | pwend = std::min(std::max(pwend, 0), pooled_width); 432 | 433 | for (int ph = phstart; ph < phend; ++ph) { 434 | for (int pw = pwstart; pw < pwend; ++pw) { 435 | if (offset_argmax_data[(ph * pooled_width + pw) * num_channels + c] == (h * data_width + w) * num_channels + c) 436 | { 437 | gradient += offset_top_diff[(ph * pooled_width + pw) * num_channels + c]; 438 | } 439 | } 440 | } 441 | } 442 | output(b) = gradient; 443 | } 444 | }; 445 | 446 | const DeviceBase::CpuWorkerThreads& worker_threads = 447 | *(context->device()->tensorflow_cpu_worker_threads()); 448 | const int64 shard_cost = 449 | num_rois * num_channels * pooled_height * pooled_width * spatial_scale; 450 | Shard(worker_threads.num_threads, worker_threads.workers, 451 | output.size(), shard_cost, shard); 452 | } 453 | private: 454 | int pooled_height_; 455 | int pooled_width_; 456 | float spatial_scale_; 457 | }; 458 | 459 | bool ROIPoolBackwardLaucher(const float* top_diff, const float spatial_scale, const int batch_size, const int num_rois, 460 | const int height, const int width, const int channels, const int pooled_height, 461 | const int pooled_width, const float* bottom_rois, 462 | float* bottom_diff, const int* argmax_data, const Eigen::GpuDevice& d); 463 | 464 | static void RoiPoolingGradKernel( 465 | OpKernelContext* context, const Tensor* bottom_data, const Tensor* bottom_rois, const Tensor* argmax_data, const Tensor* out_backprop, 466 | const float spatial_scale, const int batch_size, const int num_rois, const int height, 467 | const int width, const int channels, const int pooled_height, 468 | const int pooled_width, const TensorShape& tensor_output_shape) 469 | { 470 | Tensor* output = nullptr; 471 | OP_REQUIRES_OK(context, context->allocate_output(0, tensor_output_shape, &output)); 472 | 473 | if (!context->status().ok()) { 474 | return; 475 | } 476 | 477 | ROIPoolBackwardLaucher( 478 | out_backprop->flat().data(), spatial_scale, batch_size, num_rois, height, 479 | width, channels, pooled_height, pooled_width, bottom_rois->flat().data(), 480 | output->flat().data(), argmax_data->flat().data(), context->eigen_device()); 481 | } 482 | 483 | 484 | template 485 | class RoiPoolGradOp : public OpKernel { 486 | public: 487 | explicit RoiPoolGradOp(OpKernelConstruction* context) : OpKernel(context) { 488 | 489 | // Get the pool height 490 | OP_REQUIRES_OK(context, 491 | context->GetAttr("pooled_height", &pooled_height_)); 492 | // Check that pooled_height is positive 493 | OP_REQUIRES(context, pooled_height_ >= 0, 494 | errors::InvalidArgument("Need pooled_height >= 0, got ", 495 | pooled_height_)); 496 | // Get the pool width 497 | OP_REQUIRES_OK(context, 498 | context->GetAttr("pooled_width", &pooled_width_)); 499 | // Check that pooled_width is positive 500 | OP_REQUIRES(context, pooled_width_ >= 0, 501 | errors::InvalidArgument("Need pooled_width >= 0, got ", 502 | pooled_width_)); 503 | // Get the spatial scale 504 | OP_REQUIRES_OK(context, 505 | context->GetAttr("spatial_scale", &spatial_scale_)); 506 | } 507 | 508 | void Compute(OpKernelContext* context) override 509 | { 510 | // Grab the input tensor 511 | const Tensor& bottom_data = context->input(0); 512 | const Tensor& bottom_rois = context->input(1); 513 | const Tensor& argmax_data = context->input(2); 514 | const Tensor& out_backprop = context->input(3); 515 | 516 | // data should have 4 dimensions. 517 | OP_REQUIRES(context, bottom_data.dims() == 4, 518 | errors::InvalidArgument("data must be 4-dimensional")); 519 | 520 | // rois should have 2 dimensions. 521 | OP_REQUIRES(context, bottom_rois.dims() == 2, 522 | errors::InvalidArgument("rois must be 2-dimensional")); 523 | 524 | OP_REQUIRES(context, argmax_data.dims() == 4, 525 | errors::InvalidArgument("argmax_data must be 4-dimensional")); 526 | 527 | OP_REQUIRES(context, out_backprop.dims() == 4, 528 | errors::InvalidArgument("out_backprop must be 4-dimensional")); 529 | 530 | // Number of ROIs 531 | int num_rois = bottom_rois.dim_size(0); 532 | // batch size 533 | int batch_size = bottom_data.dim_size(0); 534 | // data height 535 | int height = bottom_data.dim_size(1); 536 | // data width 537 | int width = bottom_data.dim_size(2); 538 | // Number of channels 539 | int channels = bottom_data.dim_size(3); 540 | 541 | // construct the output shape 542 | TensorShape output_shape = bottom_data.shape(); 543 | 544 | RoiPoolingGradKernel( 545 | context, &bottom_data, &bottom_rois, &argmax_data, &out_backprop, 546 | spatial_scale_, batch_size, num_rois, height, width, channels, pooled_height_, 547 | pooled_width_, output_shape); 548 | 549 | } 550 | private: 551 | int pooled_height_; 552 | int pooled_width_; 553 | float spatial_scale_; 554 | }; 555 | 556 | REGISTER_KERNEL_BUILDER(Name("RoiPool").Device(DEVICE_CPU).TypeConstraint("T"), RoiPoolOp); 557 | REGISTER_KERNEL_BUILDER(Name("RoiPoolGrad").Device(DEVICE_CPU).TypeConstraint("T"), RoiPoolGradOp); 558 | #if GOOGLE_CUDA 559 | REGISTER_KERNEL_BUILDER(Name("RoiPool").Device(DEVICE_GPU).TypeConstraint("T"), RoiPoolOp); 560 | REGISTER_KERNEL_BUILDER(Name("RoiPoolGrad").Device(DEVICE_GPU).TypeConstraint("T"), RoiPoolGradOp); 561 | #endif 562 | -------------------------------------------------------------------------------- /networks/image_feat_net/roi_pooling_layer/roi_pooling_op.cu.o: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yuanluya/dbnet_tensorflow/991bba69fed28f96dd289bd393d566098f3185c2/networks/image_feat_net/roi_pooling_layer/roi_pooling_op.cu.o -------------------------------------------------------------------------------- /networks/image_feat_net/roi_pooling_layer/roi_pooling_op.py: -------------------------------------------------------------------------------- 1 | import tensorflow as tf 2 | import os.path as osp 3 | 4 | filename = osp.join(osp.dirname(__file__), 'roi_pooling.so') 5 | _roi_pooling_module = tf.load_op_library(filename) 6 | roi_pool = _roi_pooling_module.roi_pool 7 | roi_pool_grad = _roi_pooling_module.roi_pool_grad 8 | -------------------------------------------------------------------------------- /networks/image_feat_net/roi_pooling_layer/roi_pooling_op_gpu.cu.cc: -------------------------------------------------------------------------------- 1 | #if GOOGLE_CUDA 2 | 3 | #define EIGEN_USE_GPU 4 | 5 | #include 6 | #include 7 | #include "roi_pooling_op_gpu.h" 8 | 9 | #define CUDA_1D_KERNEL_LOOP(i, n) \ 10 | for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < n; \ 11 | i += blockDim.x * gridDim.x) 12 | 13 | using std::max; 14 | using std::min; 15 | 16 | // namespace tensorflow { 17 | using namespace tensorflow; 18 | 19 | template 20 | __global__ void ROIPoolForward(const int nthreads, const Dtype* bottom_data, 21 | const Dtype spatial_scale, const int height, const int width, 22 | const int channels, const int pooled_height, const int pooled_width, 23 | const Dtype* bottom_rois, Dtype* top_data, int* argmax_data) 24 | { 25 | CUDA_1D_KERNEL_LOOP(index, nthreads) 26 | { 27 | // (n, ph, pw, c) is an element in the pooled output 28 | int n = index; 29 | int c = n % channels; 30 | n /= channels; 31 | int pw = n % pooled_width; 32 | n /= pooled_width; 33 | int ph = n % pooled_height; 34 | n /= pooled_height; 35 | 36 | bottom_rois += n * 5; 37 | int roi_batch_ind = bottom_rois[0]; 38 | int roi_start_w = round(bottom_rois[1] * spatial_scale); 39 | int roi_start_h = round(bottom_rois[2] * spatial_scale); 40 | int roi_end_w = round(bottom_rois[3] * spatial_scale); 41 | int roi_end_h = round(bottom_rois[4] * spatial_scale); 42 | 43 | // Force malformed ROIs to be 1x1 44 | int roi_width = max(roi_end_w - roi_start_w + 1, 1); 45 | int roi_height = max(roi_end_h - roi_start_h + 1, 1); 46 | Dtype bin_size_h = static_cast(roi_height) 47 | / static_cast(pooled_height); 48 | Dtype bin_size_w = static_cast(roi_width) 49 | / static_cast(pooled_width); 50 | 51 | int hstart = static_cast(floor(static_cast(ph) 52 | * bin_size_h)); 53 | int wstart = static_cast(floor(static_cast(pw) 54 | * bin_size_w)); 55 | int hend = static_cast(ceil(static_cast(ph + 1) 56 | * bin_size_h)); 57 | int wend = static_cast(ceil(static_cast(pw + 1) 58 | * bin_size_w)); 59 | 60 | // Add roi offsets and clip to input boundaries 61 | hstart = min(max(hstart + roi_start_h, 0), height); 62 | hend = min(max(hend + roi_start_h, 0), height); 63 | wstart = min(max(wstart + roi_start_w, 0), width); 64 | wend = min(max(wend + roi_start_w, 0), width); 65 | bool is_empty = (hend <= hstart) || (wend <= wstart); 66 | 67 | // Define an empty pooling region to be zero 68 | Dtype maxval = is_empty ? 0 : -FLT_MAX; 69 | // If nothing is pooled, argmax = -1 causes nothing to be backprop'd 70 | int maxidx = -1; 71 | bottom_data += roi_batch_ind * channels * height * width; 72 | for (int h = hstart; h < hend; ++h) { 73 | for (int w = wstart; w < wend; ++w) { 74 | int bottom_index = (h * width + w) * channels + c; 75 | if (bottom_data[bottom_index] > maxval) { 76 | maxval = bottom_data[bottom_index]; 77 | maxidx = bottom_index; 78 | } 79 | } 80 | } 81 | top_data[index] = maxval; 82 | if (argmax_data != nullptr) 83 | argmax_data[index] = maxidx; 84 | } 85 | } 86 | 87 | bool ROIPoolForwardLaucher( 88 | const float* bottom_data, const float spatial_scale, const int num_rois, const int height, 89 | const int width, const int channels, const int pooled_height, 90 | const int pooled_width, const float* bottom_rois, 91 | float* top_data, int* argmax_data, const Eigen::GpuDevice& d) 92 | { 93 | const int kThreadsPerBlock = 1024; 94 | const int output_size = num_rois * pooled_height * pooled_width * channels; 95 | cudaError_t err; 96 | 97 | ROIPoolForward<<<(output_size + kThreadsPerBlock - 1) / kThreadsPerBlock, 98 | kThreadsPerBlock, 0, d.stream()>>>( 99 | output_size, bottom_data, spatial_scale, height, width, channels, pooled_height, 100 | pooled_width, bottom_rois, top_data, argmax_data); 101 | 102 | err = cudaGetLastError(); 103 | if(cudaSuccess != err) 104 | { 105 | fprintf( stderr, "cudaCheckError() failed : %s\n", cudaGetErrorString( err ) ); 106 | exit( -1 ); 107 | } 108 | 109 | return d.ok(); 110 | } 111 | 112 | 113 | template 114 | __global__ void ROIPoolBackward(const int nthreads, const Dtype* top_diff, 115 | const int* argmax_data, const int num_rois, const Dtype spatial_scale, 116 | const int height, const int width, const int channels, 117 | const int pooled_height, const int pooled_width, Dtype* bottom_diff, 118 | const Dtype* bottom_rois) { 119 | CUDA_1D_KERNEL_LOOP(index, nthreads) 120 | { 121 | // (n, h, w, c) coords in bottom data 122 | int n = index; 123 | int c = n % channels; 124 | n /= channels; 125 | int w = n % width; 126 | n /= width; 127 | int h = n % height; 128 | n /= height; 129 | 130 | Dtype gradient = 0; 131 | // Accumulate gradient over all ROIs that pooled this element 132 | for (int roi_n = 0; roi_n < num_rois; ++roi_n) 133 | { 134 | const Dtype* offset_bottom_rois = bottom_rois + roi_n * 5; 135 | int roi_batch_ind = offset_bottom_rois[0]; 136 | // Skip if ROI's batch index doesn't match n 137 | if (n != roi_batch_ind) { 138 | continue; 139 | } 140 | 141 | int roi_start_w = round(offset_bottom_rois[1] * spatial_scale); 142 | int roi_start_h = round(offset_bottom_rois[2] * spatial_scale); 143 | int roi_end_w = round(offset_bottom_rois[3] * spatial_scale); 144 | int roi_end_h = round(offset_bottom_rois[4] * spatial_scale); 145 | 146 | // Skip if ROI doesn't include (h, w) 147 | const bool in_roi = (w >= roi_start_w && w <= roi_end_w && 148 | h >= roi_start_h && h <= roi_end_h); 149 | if (!in_roi) { 150 | continue; 151 | } 152 | 153 | int offset = roi_n * pooled_height * pooled_width * channels; 154 | const Dtype* offset_top_diff = top_diff + offset; 155 | const int* offset_argmax_data = argmax_data + offset; 156 | 157 | // Compute feasible set of pooled units that could have pooled 158 | // this bottom unit 159 | 160 | // Force malformed ROIs to be 1x1 161 | int roi_width = max(roi_end_w - roi_start_w + 1, 1); 162 | int roi_height = max(roi_end_h - roi_start_h + 1, 1); 163 | 164 | Dtype bin_size_h = static_cast(roi_height) 165 | / static_cast(pooled_height); 166 | Dtype bin_size_w = static_cast(roi_width) 167 | / static_cast(pooled_width); 168 | 169 | int phstart = floor(static_cast(h - roi_start_h) / bin_size_h); 170 | int phend = ceil(static_cast(h - roi_start_h + 1) / bin_size_h); 171 | int pwstart = floor(static_cast(w - roi_start_w) / bin_size_w); 172 | int pwend = ceil(static_cast(w - roi_start_w + 1) / bin_size_w); 173 | 174 | phstart = min(max(phstart, 0), pooled_height); 175 | phend = min(max(phend, 0), pooled_height); 176 | pwstart = min(max(pwstart, 0), pooled_width); 177 | pwend = min(max(pwend, 0), pooled_width); 178 | 179 | for (int ph = phstart; ph < phend; ++ph) { 180 | for (int pw = pwstart; pw < pwend; ++pw) { 181 | if (offset_argmax_data[(ph * pooled_width + pw) * channels + c] == (h * width + w) * channels + c) 182 | { 183 | gradient += offset_top_diff[(ph * pooled_width + pw) * channels + c]; 184 | } 185 | } 186 | } 187 | } 188 | bottom_diff[index] = gradient; 189 | } 190 | } 191 | 192 | 193 | bool ROIPoolBackwardLaucher(const float* top_diff, const float spatial_scale, const int batch_size, const int num_rois, 194 | const int height, const int width, const int channels, const int pooled_height, 195 | const int pooled_width, const float* bottom_rois, 196 | float* bottom_diff, const int* argmax_data, const Eigen::GpuDevice& d) 197 | { 198 | const int kThreadsPerBlock = 1024; 199 | const int output_size = batch_size * height * width * channels; 200 | cudaError_t err; 201 | 202 | ROIPoolBackward<<<(output_size + kThreadsPerBlock - 1) / kThreadsPerBlock, 203 | kThreadsPerBlock, 0, d.stream()>>>( 204 | output_size, top_diff, argmax_data, num_rois, spatial_scale, height, width, channels, pooled_height, 205 | pooled_width, bottom_diff, bottom_rois); 206 | 207 | err = cudaGetLastError(); 208 | if(cudaSuccess != err) 209 | { 210 | fprintf( stderr, "cudaCheckError() failed : %s\n", cudaGetErrorString( err ) ); 211 | exit( -1 ); 212 | } 213 | 214 | return d.ok(); 215 | } 216 | 217 | // } // namespace tensorflow 218 | 219 | #endif // GOOGLE_CUDA 220 | -------------------------------------------------------------------------------- /networks/image_feat_net/roi_pooling_layer/roi_pooling_op_gpu.h: -------------------------------------------------------------------------------- 1 | #if !GOOGLE_CUDA 2 | #error This file must only be included when building with Cuda support 3 | #endif 4 | 5 | #ifndef TENSORFLOW_USER_OPS_ROIPOOLING_OP_GPU_H_ 6 | #define TENSORFLOW_USER_OPS_ROIPOOLING_OP_GPU_H_ 7 | 8 | #define EIGEN_USE_GPU 9 | 10 | #include "tensorflow/core/framework/tensor_types.h" 11 | #include "tensorflow/core/platform/types.h" 12 | 13 | namespace tensorflow { 14 | 15 | // Run the forward pass of max pooling, optionally writing the argmax indices to 16 | // the mask array, if it is not nullptr. If mask is passed in as nullptr, the 17 | // argmax indices are not written. 18 | bool ROIPoolForwardLaucher( 19 | const float* bottom_data, const float spatial_scale, const int num_rois, const int height, 20 | const int width, const int channels, const int pooled_height, 21 | const int pooled_width, const float* bottom_rois, 22 | float* top_data, int* argmax_data, const Eigen::GpuDevice& d); 23 | 24 | bool ROIPoolBackwardLaucher(const float* top_diff, const float spatial_scale, const int batch_size, const int num_rois, 25 | const int height, const int width, const int channels, const int pooled_height, 26 | const int pooled_width, const float* bottom_rois, 27 | float* bottom_diff, const int* argmax_data, const Eigen::GpuDevice& d); 28 | 29 | } // namespace tensorflow 30 | 31 | #endif // TENSORFLOW_CORE_KERNELS_MAXPOOLING_OP_GPU_H_ 32 | -------------------------------------------------------------------------------- /networks/image_feat_net/roi_pooling_layer/roi_pooling_op_grad.py: -------------------------------------------------------------------------------- 1 | import tensorflow as tf 2 | from tensorflow.python.framework import ops 3 | from . import roi_pooling_op 4 | import pdb 5 | 6 | 7 | @ops.RegisterShape("RoiPool") 8 | def _roi_pool_shape(op): 9 | """Shape function for the RoiPool op. 10 | 11 | """ 12 | dims_data = op.inputs[0].get_shape().as_list() 13 | channels = dims_data[3] 14 | dims_rois = op.inputs[1].get_shape().as_list() 15 | num_rois = dims_rois[0] 16 | 17 | pooled_height = op.get_attr('pooled_height') 18 | pooled_width = op.get_attr('pooled_width') 19 | 20 | output_shape = tf.TensorShape([num_rois, pooled_height, pooled_width, channels]) 21 | return [output_shape, output_shape] 22 | 23 | @ops.RegisterGradient("RoiPool") 24 | def _roi_pool_grad(op, grad, _): 25 | """The gradients for `roi_pool`. 26 | Args: 27 | op: The `roi_pool` `Operation` that we are differentiating, which we can use 28 | to find the inputs and outputs of the original op. 29 | grad: Gradient with respect to the output of the `roi_pool` op. 30 | Returns: 31 | Gradients with respect to the input of `zero_out`. 32 | """ 33 | data = op.inputs[0] 34 | rois = op.inputs[1] 35 | argmax = op.outputs[1] 36 | pooled_height = op.get_attr('pooled_height') 37 | pooled_width = op.get_attr('pooled_width') 38 | spatial_scale = op.get_attr('spatial_scale') 39 | 40 | # compute gradient 41 | data_grad = roi_pooling_op.roi_pool_grad(data, rois, argmax, grad, pooled_height, pooled_width, spatial_scale) 42 | 43 | return [data_grad, None] # List of one Tensor, since we have one input 44 | -------------------------------------------------------------------------------- /networks/image_feat_net/roi_pooling_layer/roi_pooling_op_test.py: -------------------------------------------------------------------------------- 1 | import tensorflow as tf 2 | import numpy as np 3 | import roi_pooling_op 4 | import roi_pooling_op_grad 5 | import tensorflow as tf 6 | import pdb 7 | 8 | 9 | def weight_variable(shape): 10 | initial = tf.truncated_normal(shape, stddev=0.1) 11 | return tf.Variable(initial) 12 | 13 | def conv2d(x, W): 14 | return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME') 15 | 16 | array = np.random.rand(32, 100, 100, 3) 17 | data = tf.convert_to_tensor(array, dtype=tf.float32) 18 | rois = tf.convert_to_tensor([[0, 10, 10, 20, 20], [31, 30, 30, 40, 40]], dtype=tf.float32) 19 | 20 | W = weight_variable([3, 3, 3, 1]) 21 | h = conv2d(data, W) 22 | 23 | [y, argmax] = roi_pooling_op.roi_pool(h, rois, 6, 6, 1.0/3) 24 | pdb.set_trace() 25 | y_data = tf.convert_to_tensor(np.ones((2, 6, 6, 1)), dtype=tf.float32) 26 | print y_data, y, argmax 27 | 28 | # Minimize the mean squared errors. 29 | loss = tf.reduce_mean(tf.square(y - y_data)) 30 | optimizer = tf.train.GradientDescentOptimizer(0.5) 31 | train = optimizer.minimize(loss) 32 | 33 | init = tf.initialize_all_variables() 34 | 35 | # Launch the graph. 36 | sess = tf.Session(config=tf.ConfigProto(log_device_placement=True)) 37 | sess.run(init) 38 | pdb.set_trace() 39 | for step in xrange(10): 40 | sess.run(train) 41 | print(step, sess.run(W)) 42 | print(sess.run(y)) 43 | 44 | #with tf.device('/gpu:0'): 45 | # result = module.roi_pool(data, rois, 1, 1, 1.0/1) 46 | # print result.eval() 47 | #with tf.device('/cpu:0'): 48 | # run(init) 49 | -------------------------------------------------------------------------------- /networks/image_feat_net/roi_pooling_layer/work_sharder.h: -------------------------------------------------------------------------------- 1 | /* Copyright 2015 The TensorFlow Authors. All Rights Reserved. 2 | 3 | Licensed under the Apache License, Version 2.0 (the "License"); 4 | you may not use this file except in compliance with the License. 5 | You may obtain a copy of the License at 6 | 7 | http://www.apache.org/licenses/LICENSE-2.0 8 | 9 | Unless required by applicable law or agreed to in writing, software 10 | distributed under the License is distributed on an "AS IS" BASIS, 11 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | See the License for the specific language governing permissions and 13 | limitations under the License. 14 | ==============================================================================*/ 15 | 16 | #ifndef TENSORFLOW_UTIL_WORK_SHARDER_H_ 17 | #define TENSORFLOW_UTIL_WORK_SHARDER_H_ 18 | 19 | #include 20 | 21 | #include "tensorflow/core/lib/core/threadpool.h" 22 | #include "tensorflow/core/platform/types.h" 23 | 24 | namespace tensorflow { 25 | 26 | // Shards the "total" unit of work assuming each unit of work having 27 | // roughly "cost_per_unit". Each unit of work is indexed 0, 1, ..., 28 | // total - 1. Each shard contains 1 or more units of work and the 29 | // total cost of each shard is roughly the same. The calling thread and the 30 | // "workers" are used to compute each shard (calling work(start, 31 | // limit). A common configuration is that "workers" is a thread pool 32 | // with at least "max_parallelism" threads. 33 | // 34 | // "cost_per_unit" is an estimate of the number of CPU cycles (or nanoseconds 35 | // if not CPU-bound) to complete a unit of work. Overestimating creates too 36 | // many shards and CPU time will be dominated by per-shard overhead, such as 37 | // Context creation. Underestimating may not fully make use of the specified 38 | // parallelism. 39 | // 40 | // "work" should be a callable taking (int64, int64) arguments. 41 | // work(start, limit) computes the work units from [start, 42 | // limit), i.e., [start, limit) is a shard. 43 | // 44 | // REQUIRES: max_parallelism >= 0 45 | // REQUIRES: workers != nullptr 46 | // REQUIRES: total >= 0 47 | // REQUIRES: cost_per_unit >= 0 48 | void Shard(int max_parallelism, thread::ThreadPool* workers, int64 total, 49 | int64 cost_per_unit, std::function work); 50 | 51 | } // end namespace tensorflow 52 | 53 | #endif // TENSORFLOW_UTIL_WORK_SHARDER_H_ 54 | -------------------------------------------------------------------------------- /networks/image_feat_net/vgg16/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yuanluya/dbnet_tensorflow/991bba69fed28f96dd289bd393d566098f3185c2/networks/image_feat_net/vgg16/__init__.py -------------------------------------------------------------------------------- /networks/image_feat_net/vgg16/net.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import inspect 3 | import os 4 | import tensorflow as tf 5 | from ..roi_pooling_layer.roi_pooling_op import roi_pool 6 | from ..roi_pooling_layer.roi_pooling_op_grad import * 7 | 8 | class Vgg16: 9 | def __init__(self, lr, RegionNet_npy_path = 'frcnn_Region_Feat_Net.npy', train = True): 10 | #load saved model 11 | try: 12 | path = inspect.getfile(Vgg16) 13 | path = os.path.abspath(os.path.join(path, os.pardir)) 14 | RegionNet_npy_path = os.path.join(path, RegionNet_npy_path) 15 | self.data_dict = np.load(RegionNet_npy_path, encoding='latin1').item() 16 | print("Image Feat Net npy file loaded") 17 | except: 18 | print('[WARNING!]Image Feat Net npy file not found,' 19 | 'we don\'t recommend training this network from scratch') 20 | self.data_dict = {} 21 | self.lr = lr 22 | self.train = train 23 | self.varlist_conv = [] 24 | self.varlist_region = [] 25 | self.net_type = 'Vgg16' 26 | 27 | def build(self, bgr, rois, parameters): 28 | #set placeholder 29 | self.bgr = bgr 30 | self.rois = rois 31 | 32 | #set parameters 33 | self.feature_dim = parameters['feature_dim'] 34 | self.weight_decay = parameters['weight_decay'] 35 | self.dropout_ratio = parameters['dropout_ratio'] 36 | self.dropout_flag = parameters['dropout_flag'] 37 | self.roi_size = parameters['roi_size'] 38 | self.roi_scale = parameters['roi_scale'] 39 | self.build_conv() 40 | self.build_region() 41 | return self.relu7 42 | 43 | def build_conv(self): 44 | """ 45 | load variable from npy to build the VGG 46 | 47 | :param bgr: bgr image [batch, height, width, 3] values scaled [0, 1] 48 | """ 49 | # Convert RGB to BGR 50 | self.conv1_1 = self.conv_layer(self.bgr, "conv1_1") 51 | self.conv1_2 = self.conv_layer(self.conv1_1, "conv1_2") 52 | self.pool1 = self.max_pool(self.conv1_2, 'pool1') 53 | 54 | self.conv2_1 = self.conv_layer(self.pool1, "conv2_1") 55 | self.conv2_2 = self.conv_layer(self.conv2_1, "conv2_2") 56 | self.pool2 = self.max_pool(self.conv2_2, 'pool2') 57 | 58 | self.conv3_1 = self.conv_layer(self.pool2, "conv3_1") 59 | self.conv3_2 = self.conv_layer(self.conv3_1, "conv3_2") 60 | self.conv3_3 = self.conv_layer(self.conv3_2, "conv3_3") 61 | self.pool3 = self.max_pool(self.conv3_3, 'pool3') 62 | 63 | self.conv4_1 = self.conv_layer(self.pool3, "conv4_1") 64 | self.conv4_2 = self.conv_layer(self.conv4_1, "conv4_2") 65 | self.conv4_3 = self.conv_layer(self.conv4_2, "conv4_3") 66 | self.pool4 = self.max_pool(self.conv4_3, 'pool4') 67 | 68 | self.conv5_1 = self.conv_layer(self.pool4, "conv5_1") 69 | self.conv5_2 = self.conv_layer(self.conv5_1, "conv5_2") 70 | self.conv5_3 = self.conv_layer(self.conv5_2, "conv5_3") 71 | 72 | def build_region(self): 73 | [self.rois_feat, _] = roi_pool(self.conv5_3, self.rois, 74 | self.roi_size, self.roi_size, 75 | self.roi_scale) 76 | 77 | # reshape tensor so that every channel's map are expanded 78 | # with rows unchanged 79 | conv_channels = self.rois_feat.get_shape().as_list()[-1] 80 | self.rois_feat_reshape1 = tf.reshape(self.rois_feat, 81 | [-1, self.roi_size ** 2, 82 | conv_channels]) 83 | self.rois_feat_transpose = tf.transpose(self.rois_feat_reshape1, 84 | perm = [0, 2, 1]) 85 | self.rois_feat_reshape2 = tf.reshape(self.rois_feat_transpose, 86 | [-1, self.roi_size ** 2 * 87 | conv_channels]) 88 | 89 | self.fc6 = self.fc_layer(self.rois_feat_reshape2, 'fc6', 90 | [self.roi_size ** 2 * 512, 4096]) 91 | self.relu6 = tf.nn.relu(self.fc6) 92 | 93 | #hand write dropout 94 | if self.train: 95 | self.relu6 = dropout(self.relu6, self.dropout_flag, 96 | self.dropout_ratio, 'fc6_dropout') 97 | 98 | self.fc7 = self.fc_layer(self.relu6, "fc7", [4096, self.feature_dim]) 99 | self.relu7 = tf.nn.relu(self.fc7) 100 | 101 | def conv_layer(self, bottom, name): 102 | with tf.variable_scope(name): 103 | filt = self.get_conv_filter(name) 104 | 105 | conv = tf.nn.conv2d(bottom, filt, [1, 1, 1, 1], padding='SAME') 106 | 107 | conv_biases = self.get_bias_conv(name) 108 | bias = tf.nn.bias_add(conv, conv_biases) 109 | 110 | relu = tf.nn.relu(bias) 111 | return relu 112 | 113 | def get_conv_filter(self, name): 114 | var = tf.Variable(self.data_dict[name]['weights'], name="filter", 115 | trainable = (self.lr > 0), dtype = tf.float32) 116 | wd = tf.multiply(tf.nn.l2_loss(var), self.weight_decay, name = 'weight_decay') 117 | tf.add_to_collection('img_net_weight_decay', wd) 118 | self.varlist_conv.append(var) 119 | return var 120 | 121 | def get_bias_conv(self, name): 122 | var = tf.Variable(self.data_dict[name]['biases'], name="biases", 123 | trainable = (self.lr > 0), dtype = tf.float32) 124 | self.varlist_conv.append(var) 125 | return var 126 | 127 | def max_pool(self, bottom, name): 128 | return tf.nn.max_pool(bottom, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME', name=name) 129 | 130 | def fc_layer(self, bottom, name, shape): 131 | with tf.variable_scope(name) as scope: 132 | weights = self.get_fc_weight(name, shape) 133 | biases = self.get_bias(name, [shape[1]]) 134 | 135 | # Fully connected layer. Note that the '+' operation automatically 136 | # broadcasts the biases. 137 | fc = tf.nn.bias_add(tf.matmul(bottom, weights), biases) 138 | 139 | return fc 140 | 141 | def get_bias(self, name, shape): 142 | if self.data_dict.get(name): 143 | init = tf.constant_initializer( 144 | value = self.data_dict[name]['biases'], dtype = tf.float32) 145 | else: 146 | init = tf.constant_initializer(0.015) 147 | print('[WARNING]Region Feat Net %s layer\'s bias are random init ' 148 | 'with shape [%d]' % (name, shape[0])) 149 | 150 | var = tf.get_variable(name = 'bias', initializer = init, 151 | shape = shape, dtype = tf.float32) 152 | self.varlist_region.append(var) 153 | return var 154 | 155 | def get_fc_weight(self, name, shape): 156 | if self.data_dict.get(name): 157 | init = tf.constant_initializer( 158 | value = self.data_dict[name]['weights'], dtype = tf.float32) 159 | else: 160 | init = tf.random_normal_initializer(mean = 0.0, stddev = 0.0005) 161 | print('[WARNING]Region Feat Net %s layer\'s weights are ' 162 | 'random init!' % name) 163 | 164 | var = tf.get_variable(name = 'weights', initializer = init, 165 | shape = shape, dtype = tf.float32) 166 | weight_decay = tf.multiply(tf.nn.l2_loss(var), self.weight_decay, 167 | name = 'weight_decay') 168 | tf.add_to_collection('img_net_weight_decay', weight_decay) 169 | self.varlist_region.append(var) 170 | return var 171 | 172 | def dropout(bottom, random_flag, dropout_ratio = 0.5, name = 'dropout'): 173 | with tf.variable_scope(name): 174 | drop_mask_r = tf.random_uniform(shape = tf.shape(bottom)) 175 | drop_mask_r = tf.cast(tf.greater(drop_mask_r, dropout_ratio), 176 | tf.float32) 177 | drop_mask_v = tf.Variable(initial_value = np.zeros(1), 178 | validate_shape = False, trainable = False, dtype = tf.float32) 179 | assign_dropout = tf.assign(drop_mask_v, drop_mask_r, 180 | validate_shape = False) 181 | assign_dropout = tf.cond(tf.equal(random_flag, 1), 182 | lambda: tf.assign(drop_mask_v, drop_mask_r, 183 | validate_shape = False), 184 | lambda: tf.identity(drop_mask_v)) 185 | return tf.div(tf.multiply(bottom, assign_dropout), (1 - dropout_ratio)) 186 | 187 | -------------------------------------------------------------------------------- /networks/net_wrapper.py: -------------------------------------------------------------------------------- 1 | import tensorflow as tf 2 | import numpy as np 3 | import pdb 4 | from .image_feat_net.vgg16.net import Vgg16 5 | from .image_feat_net.resnet101.net import Resnet101 6 | from .image_feat_net.net import ImageFeatNet 7 | from .text_feat_net.net import TextFeatNet 8 | from .pair_net.net import PairNet 9 | from .base import Model 10 | 11 | class NetWrapper(Model): 12 | """ 13 | The class is for wrapping TextFeatNet, ImageFeatNet and PairNet 14 | together and provide a uniform interface for the network training 15 | steps. 16 | Attributes: 17 | sess: Tensorflow session 18 | opt_image: optimizer for image end 19 | opt_text: optimizer for text end 20 | train: a boolean indicating if the network is currently in the 21 | training phase, default is True 22 | pair_net_max_batch_size: an integer indicating the maximum batch 23 | size of the pair net, default is 500 24 | """ 25 | def __init__(self, sess, image_net_type, image_lr_conv, image_lr_region, text_lr, 26 | pair_net_max_batch_size, train, image_init_npy, text_init_npy): 27 | 28 | self.image_net_type = image_net_type 29 | self.image_lr_conv = image_lr_conv 30 | self.image_lr_region = image_lr_region 31 | self.text_lr = text_lr 32 | 33 | self.sess = sess 34 | self.train = (train == 'train') 35 | self.pair_net = PairNet(sess, pair_net_max_batch_size, 36 | train = self.train) 37 | self.image_net = ImageFeatNet(sess, self.image_lr_conv, 38 | self.image_lr_region, train = self.train) 39 | self.text_net_opt_type = 'Adam' 40 | if self.image_lr_conv != 0 or self.image_lr_region != 0: 41 | self.text_net_opt_type = 'SGD' 42 | self.text_net = TextFeatNet(sess, self.text_lr, train = self.train, 43 | opt_type = self.text_net_opt_type, TextNet_npy_path = text_init_npy) 44 | if self.image_net_type == 'resnet101': 45 | self.text_feature_dim = 2049 46 | self.im_sub_net = Resnet101(RegionNet_npy_path = image_init_npy, train = self.train) 47 | elif self.image_net_type == 'vgg16': 48 | self.text_feature_dim = 4097 49 | self.im_sub_net = Vgg16(self.image_lr_conv, 50 | RegionNet_npy_path = image_init_npy, train = self.train) 51 | self.data_dict = None 52 | self.varlist = None 53 | 54 | def set_input(self, data): 55 | self.data_dict = data 56 | 57 | def build(self): 58 | with tf.variable_scope('Text_Network'): 59 | self.text_net.build(output_feature_dim = self.text_feature_dim) 60 | 61 | with tf.variable_scope('Image_Network'): 62 | if self.image_net_type == 'resnet101': 63 | self.image_net.build(self.im_sub_net, feature_dim = 2048, roi_size = 14) 64 | else: 65 | self.image_net.build(self.im_sub_net) 66 | 67 | with tf.variable_scope('Pair_Network'): 68 | self.pair_net.build(im_feat = self.image_net.output, 69 | dy_param = self.text_net.output, 70 | feature_dim = self.text_feature_dim - 1) 71 | 72 | self.image_net.output_grad = ( 73 | self.pair_net.gradients_pool[self.image_net.output]) 74 | self.text_net.output_grad = ( 75 | self.pair_net.gradients_pool[self.text_net.output]) 76 | 77 | self.image_net.accumulate() 78 | self.text_net.accumulate() 79 | self.varlist = self.image_net.sub_net.varlist_conv\ 80 | + self.image_net.sub_net.varlist_region\ 81 | + self.text_net.varlist\ 82 | + self.text_net.varlist_relu 83 | 84 | def forward(self, compute_grads = True, compute_loss = True): 85 | self.image_net.set_input(self.data_dict['images'], 86 | self.data_dict['rois']) 87 | self.image_net.forward() 88 | self.text_net.set_input(self.data_dict['phrases']) 89 | self.text_net.forward() 90 | self.pair_net.set_input(self.data_dict['roi_ids'], 91 | self.data_dict['phrase_ids'], 92 | self.data_dict['labels'], 93 | self.data_dict['loss_weights'], 94 | self.data_dict['sources']) 95 | self.pair_net.forward(compute_grads, compute_loss) 96 | 97 | def backward(self): 98 | self.pair_net.backward() 99 | self.text_net.backward() 100 | self.image_net.backward() 101 | 102 | def forward_backward(self): 103 | self.forward() 104 | self.backward() 105 | 106 | def get_output(self, current_iter = 0): 107 | self.output = self.pair_net.get_output() 108 | if current_iter is not 0: 109 | self.show_result(current_iter) 110 | return self.output 111 | 112 | def show_result(self, current_iter): 113 | self.prediction = self.output[1] > 0.5 114 | total_pos = np.sum(self.data_dict['labels'] == 1) 115 | total_predict = np.sum(self.prediction == 1) 116 | self.recall = (np.sum((self.data_dict['labels'] == 1) * 117 | (self.data_dict['labels'] == 118 | self.prediction[:, 0])) 119 | / total_pos) 120 | self.precision = (np.sum((self.data_dict['labels'] == 1) * 121 | (self.data_dict['labels'] == 122 | self.prediction[:, 0])) 123 | / total_predict) 124 | #print results 125 | print('Iter: %d' % current_iter) 126 | print('Looked images:', self.data_dict['image_ids']) 127 | print('\t[$$]Precision: %f, Recall: %f' % (self.precision, 128 | self.recall)) 129 | print('\t[TL] Total loss is %f' % self.output[0]) 130 | print('\t[PL]Raw positive loss is %f' % self.output[2]) 131 | print('\t[NL]Raw negative loss is %f' % self.output[3]) 132 | print('\t[RL]Raw rest loss is %f\n' % self.output[4]) 133 | -------------------------------------------------------------------------------- /networks/pair_net/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yuanluya/dbnet_tensorflow/991bba69fed28f96dd289bd393d566098f3185c2/networks/pair_net/__init__.py -------------------------------------------------------------------------------- /networks/pair_net/net.py: -------------------------------------------------------------------------------- 1 | import tensorflow as tf 2 | import numpy as np 3 | import os 4 | import pdb 5 | from ..base_net.net import BaseNet 6 | 7 | class PairNet(BaseNet): 8 | """ Network model for Pair Net 9 | Attributes: 10 | sess: 11 | max_batch_size: 12 | train: 13 | """ 14 | def __init__(self, sess, max_batch_size, train = True): 15 | super(PairNet, self).__init__(sess) 16 | #model hyperparameters 17 | self.max_batch_size = max_batch_size 18 | self.train = train 19 | self.epsilon = 1e-6 20 | 21 | #physical inputs should be numpy arrays 22 | self.image_ids = None #[Total ~ M X N] 23 | self.text_ids = None #[Total] 24 | self.p_labels = None #[Total] 25 | self.p_loss_weights = None #[Total] 26 | self.p_sources = None #[Total] 27 | self.batch_total = tf.placeholder(tf.float32, name = 'batch_size') 28 | self.pos_batch_total = tf.placeholder(tf.float32, 29 | name = 'pos_batch_size') 30 | self.neg_batch_total = tf.placeholder(tf.float32, 31 | name = 'neg_batch_size') 32 | self.res_batch_total = tf.placeholder(tf.float32, 33 | name = 'res_batch_size') 34 | 35 | #physcial outputs 36 | self.p_loss = None 37 | self.p_sim = None 38 | 39 | def build(self, im_feat = tf.placeholder(tf.float32, name = 'im_feat'), 40 | dy_param = tf.placeholder(tf.float32, name = 'dy_param'), 41 | feature_dim = 4096, image_dropout = 0.3, 42 | text_dropout = 0.3, weight_decay= 1e-7): 43 | 44 | #model parameters 45 | self.feature_dim = feature_dim 46 | self.weight_decay = weight_decay 47 | self.image_dropout = image_dropout 48 | self.text_dropout = text_dropout 49 | if not self.train: 50 | self.image_dropout = 0 51 | self.text_dropout = 0 52 | 53 | #used for data loader 54 | #[batch_size, feature_dim (+ 1)] for im_feat and dy_param 55 | self.im_feat = im_feat 56 | self.dy_param = dy_param 57 | self.im_idx = tf.placeholder(tf.int32, name = 'im_idx') 58 | self.txt_idx = tf.placeholder(tf.int32, name = 'txt_idx') 59 | self.labels = tf.placeholder(tf.int32, name = 'labels') 60 | self.loss_weights = tf.placeholder(tf.float32, name = 'loss_weights') 61 | self.sources = tf.placeholder(tf.int32, name = 'pair_sources') 62 | 63 | ####################################################################### 64 | ######################## NETWORK STARTS ######################### 65 | ####################################################################### 66 | self.im_feat_chosen = tf.gather(self.im_feat, self.im_idx) 67 | self.dy_param_chosen = tf.gather(self.dy_param, self.txt_idx) 68 | 69 | self.im_feat_dropout = tf.nn.dropout(self.im_feat_chosen, 70 | 1 - self.image_dropout) 71 | self.dy_param_decay = self.weight_decay * tf.to_double(tf.nn.l2_loss(self.dy_param_chosen)) 72 | 73 | #prepare kernel 74 | self.dy_kernel = tf.slice(self.dy_param_chosen, 75 | [0, 0], [-1, self.feature_dim]) 76 | self.dy_bias = tf.slice(self.dy_param_chosen, 77 | [0, self.feature_dim], [-1, 1]) 78 | self.dy_kernel_dropout = tf.nn.dropout(self.dy_kernel, 1 - self.text_dropout) 79 | 80 | #get binary classification score 81 | self.cls_pre = tf.reduce_sum(tf.multiply(self.im_feat_dropout, 82 | self.dy_kernel_dropout), 1) 83 | self.cls_single = tf.add(tf.expand_dims(self.cls_pre, -1), self.dy_bias) 84 | self.cls = tf.concat([-1 * self.cls_single, self.cls_single], axis = 1) 85 | #logit = lambda x: np.log(x) - np.log(1 - x) 86 | #self.cls = tf.clip_by_value(tf.add(tf.expand_dims(self.cls_pre, -1), self.dy_bias), logit(self.epsilon), logit(1 - self.epsilon)) 87 | 88 | self.sim = tf.slice(tf.nn.softmax(self.cls), [0, 1], [-1, 1]) 89 | self.full_labels = tf.one_hot(self.labels, 2) 90 | losses = tf.nn.softmax_cross_entropy_with_logits( 91 | labels = self.full_labels, logits = self.cls) 92 | self.losses = tf.expand_dims(tf.multiply(self.loss_weights, losses), -1) 93 | self.pos_mask = tf.expand_dims(tf.equal(self.sources, 1), -1) 94 | self.neg_mask = tf.expand_dims(tf.equal(self.sources, 0), -1) 95 | self.rest_mask = tf.expand_dims(tf.equal(self.sources, 2), -1) 96 | 97 | pos_mask = tf.cast(self.pos_mask, tf.float32) 98 | neg_mask = tf.cast(self.neg_mask, tf.float32) 99 | rest_mask = tf.cast(self.rest_mask, tf.float32) 100 | 101 | self.pos_loss = tf.reduce_sum(tf.multiply(self.losses, pos_mask)) / self.pos_batch_total 102 | self.neg_loss = tf.reduce_sum(tf.multiply(self.losses, neg_mask)) / self.neg_batch_total 103 | self.rest_loss = tf.reduce_sum(tf.multiply(self.losses, rest_mask)) / self.res_batch_total 104 | self.cls_loss = self.pos_loss + self.neg_loss + self.rest_loss 105 | 106 | #accumulate gradients 107 | self.current_batch_size = tf.to_float(tf.shape(self.im_feat_chosen)[0]) 108 | self.loss = tf.to_float(self.dy_param_decay) + self.cls_loss 109 | self.xs = [self.im_feat, self.dy_param] 110 | self.ys = [self.loss] 111 | self.accumulate() 112 | 113 | def set_input(self, image_ids, text_ids, labels = None, 114 | loss_weights = None, sources = None): 115 | self.image_ids = image_ids 116 | self.text_ids = text_ids 117 | self.p_labels = labels 118 | self.p_loss_weights = loss_weights 119 | self.p_sources = sources 120 | 121 | def get_output(self): 122 | return (self.p_loss, self.p_sim, self.p_pos_loss, self.p_neg_loss, 123 | self.p_rest_loss) 124 | 125 | def forward(self, compute_grads = True, compute_loss = True): 126 | #initialize accumulating variables for this batch 127 | current_im_idx = 0 128 | self.p_loss = 0 129 | self.p_pos_loss = 0 130 | self.p_neg_loss = 0 131 | self.p_rest_loss = 0 132 | self.p_decay = 0 133 | self.p_sim = np.zeros((0, 1)) 134 | 135 | #set parameters for this batch 136 | total_pos = np.sum(self.p_sources == 1) 137 | assert total_pos == np.sum(self.p_labels == 1) 138 | total_neg = np.sum(self.p_sources == 0) 139 | total_res = np.sum(self.p_sources == 2) 140 | self.p_batch_sizes = self.text_ids.shape[0] * 1.0 141 | 142 | self.batch_size = self.max_batch_size 143 | #start accumulate gradients for subbatches 144 | while current_im_idx < self.text_ids.shape[0]: 145 | if compute_grads: 146 | [p_loss, p_pos_loss, p_neg_loss, p_rest_loss, dy_param_decay, p_sim, _] = ( 147 | self.sess.run( 148 | [self.loss, self.pos_loss, self.neg_loss, self.rest_loss, self.dy_param_decay, self.sim, self.accumulate_grad], 149 | feed_dict = { 150 | self.im_idx: 151 | self.image_ids[current_im_idx: current_im_idx + self.batch_size], 152 | self.txt_idx: 153 | self.text_ids[current_im_idx: current_im_idx + self.batch_size], 154 | self.labels: 155 | self.p_labels[current_im_idx: current_im_idx + self.batch_size], 156 | self.loss_weights: 157 | self.p_loss_weights[current_im_idx: current_im_idx + self.batch_size], 158 | self.sources: 159 | self.p_sources[current_im_idx: current_im_idx + self.batch_size], 160 | self.batch_total: self.p_batch_sizes, 161 | self.pos_batch_total: total_pos, 162 | self.neg_batch_total: total_neg, 163 | self.res_batch_total: total_res, 164 | self.batch_num: current_im_idx})) 165 | elif compute_loss: 166 | [p_loss, p_pos_loss, p_neg_loss, p_rest_loss, dy_param_decay, p_sim] = ( 167 | self.sess.run( 168 | [self.loss, self.pos_loss, self.neg_loss, 169 | self.rest_loss, self.dy_param_decay, self.sim], 170 | feed_dict = { 171 | self.im_idx: 172 | self.image_ids[current_im_idx: 173 | current_im_idx + 174 | self.batch_size], 175 | self.txt_idx: 176 | self.text_ids[current_im_idx: 177 | current_im_idx + 178 | self.batch_size], 179 | self.labels: 180 | self.p_labels[current_im_idx: 181 | current_im_idx + 182 | self.batch_size], 183 | self.loss_weights: 184 | self.p_loss_weights[current_im_idx: 185 | current_im_idx + 186 | self.batch_size], 187 | self.sources: 188 | self.p_sources[current_im_idx: 189 | current_im_idx + 190 | self.batch_size], 191 | self.batch_total: self.p_batch_sizes, 192 | self.pos_batch_total: total_pos, 193 | self.res_batch_total: total_res, 194 | self.neg_batch_total: total_neg})) 195 | else: 196 | [p_sim] = self.sess.run([self.sim], 197 | feed_dict = { 198 | self.im_idx: 199 | self.image_ids[current_im_idx: 200 | current_im_idx + 201 | self.batch_size], 202 | self.txt_idx: 203 | self.text_ids[current_im_idx: 204 | current_im_idx + 205 | self.batch_size]}) 206 | p_loss = None 207 | if compute_loss: 208 | self.p_loss += p_loss 209 | self.p_pos_loss += p_pos_loss 210 | self.p_neg_loss += p_neg_loss 211 | self.p_rest_loss += p_rest_loss 212 | self.p_decay += dy_param_decay 213 | self.p_sim = np.concatenate((self.p_sim, p_sim)) 214 | current_im_idx += self.batch_size 215 | 216 | #avoid small subbatch 217 | if self.p_batch_sizes > current_im_idx + self.batch_size and\ 218 | self.p_batch_sizes - current_im_idx - self.batch_size < 0.25 * self.batch_size: 219 | self.batch_size = int(self.p_batch_sizes - current_im_idx) 220 | return 221 | -------------------------------------------------------------------------------- /networks/text_feat_net/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yuanluya/dbnet_tensorflow/991bba69fed28f96dd289bd393d566098f3185c2/networks/text_feat_net/__init__.py -------------------------------------------------------------------------------- /networks/text_feat_net/net.py: -------------------------------------------------------------------------------- 1 | import tensorflow as tf 2 | import numpy as np 3 | import os 4 | import pdb 5 | import inspect 6 | 7 | class TextFeatNet: 8 | """ Network for Text Feat Net 9 | Attributes: 10 | sess: Tensorflow session 11 | opt: Optimizer 12 | max_batch_size: 13 | train: 14 | sequence_length: 15 | TextNet_npy_path: 16 | """ 17 | def __init__(self, sess, lr, max_batch_size = 32, 18 | train = True, sequence_length = 256, opt_type = 'Adam', 19 | TextNet_npy_path = 'Text_Feat_Net.npy'): 20 | #load saved model 21 | try: 22 | path = inspect.getfile(TextFeatNet) 23 | path = os.path.abspath(os.path.join(path, os.pardir)) 24 | TextNet_npy_path = os.path.join(path, TextNet_npy_path) 25 | self.data_dict = np.load(TextNet_npy_path, encoding='latin1').item() 26 | print("Text Feat Net npy file loaded") 27 | except: 28 | self.data_dict = {} 29 | print('[WARNING]Text Feat load file not found, train from scratch!') 30 | self.lr = lr 31 | self.backward_counter = 0 32 | self.opt_type = opt_type 33 | if self.opt_type == 'SGD': 34 | self.opt = tf.train.GradientDescentOptimizer(self.lr) 35 | self.opt_relu = tf.train.GradientDescentOptimizer(0.1 * self.lr) 36 | elif self.opt_type == 'Adam': 37 | self.opt = tf.train.AdamOptimizer(self.lr) 38 | self.opt_relu = tf.train.AdamOptimizer(0.1 * self.lr) 39 | self.varlist = [] 40 | self.varlist_relu = [] 41 | #model hyperparameters 42 | self.sess = sess 43 | self.sequence_length = sequence_length 44 | self.max_batch_size = max_batch_size 45 | self.train = train 46 | 47 | #physcial outputs 48 | self.p_dy_param = None 49 | 50 | def build(self, output_grad = tf.placeholder(tf.float32), 51 | input_feature_dim = 74, weight_decay= 5e-4, 52 | relu_coef = 0.1, batch_size = 16, 53 | output_feature_dim = 4097): 54 | 55 | #optimization utility 56 | self.output_grad = output_grad 57 | 58 | self.batch_size = batch_size 59 | assert self.batch_size < self.max_batch_size 60 | 61 | #model parameters 62 | self.input_feature_dim = input_feature_dim 63 | self.output_feature_dim = output_feature_dim 64 | self.weight_decay = weight_decay 65 | self.relu_coef = relu_coef 66 | 67 | #physical inputs should be numpy arrays 68 | self.texts = tf.placeholder(tf.float32, 69 | shape = [None, 1, self.sequence_length, 70 | self.input_feature_dim]) 71 | self.p_texts = None 72 | 73 | ####################################################################### 74 | ######################## NETWORK STARTS ######################### 75 | ####################################################################### 76 | 77 | #conv1 78 | self.conv1 = self.conv_layer(self.texts, 'conv1', 79 | [1, 7, self.input_feature_dim, 256]) 80 | self.conv1_relu = self.lrelu(self.conv1, self.relu_coef, 'conv1_relu') 81 | self.conv1_pool = tf.nn.max_pool(self.conv1_relu, 82 | [1, 1, 2, 1], [1, 1, 2, 1], 83 | padding = 'VALID') 84 | 85 | #conv2 86 | self.conv2_1 = self.conv_layer(self.conv1_pool, 'conv2_1', 87 | [1, 7, 256, 256]) 88 | self.conv2_1_relu = self.lrelu(self.conv2_1, self.relu_coef, 'conv2_1_relu') 89 | self.conv2_2 = self.conv_layer(self.conv2_1_relu, 'conv2_2', 90 | [1, 3, 256, 256]) 91 | self.conv2_2_relu = self.lrelu(self.conv2_2, self.relu_coef, 'conv2_2_relu') 92 | self.conv2_3 = self.conv_layer(self.conv2_2_relu, 'conv2_3', 93 | [1, 3, 256, 256]) 94 | self.conv2_3_relu = self.lrelu(self.conv2_3, self.relu_coef, 'conv2_3_relu') 95 | self.conv2_3_pool = tf.nn.max_pool(self.conv2_3_relu, 96 | [1, 1, 2, 1], [1, 1, 2, 1], 97 | padding = 'VALID', 98 | name = 'conv2_3_pooling') 99 | 100 | #conv3 101 | self.conv3_1 = self.conv_layer(self.conv2_3_pool, 'conv3_1', 102 | [1, 3, 256, 512]) 103 | self.conv3_1_relu = self.lrelu(self.conv3_1, self.relu_coef, 'conv3_1_relu') 104 | self.conv3_2 = self.conv_layer(self.conv3_1_relu, 'conv3_2', 105 | [1, 3, 512, 512]) 106 | self.conv3_2_relu = self.lrelu(self.conv3_2, self.relu_coef, 'conv3_2_relu') 107 | self.conv3_2_pool = tf.nn.max_pool(self.conv3_2_relu, 108 | [1, 1, 2, 1], [1, 1, 2, 1], 109 | padding = 'VALID', 110 | name = 'conv3_2_pooling') 111 | 112 | #fully connected 113 | expand_size = self.conv3_2_pool.get_shape().as_list() 114 | self.conv3_2_reshape_1 = tf.reshape(self.conv3_2_pool, 115 | [-1, expand_size[1] * expand_size[2], expand_size[3] ]) 116 | self.conv3_2_transpose = tf.transpose(self.conv3_2_reshape_1, [0, 2, 1]) 117 | self.conv3_2_reshape_2 = tf.reshape(self.conv3_2_transpose, 118 | [-1, expand_size[1] * expand_size[2] * expand_size[3]]) 119 | 120 | self.fc4 = self.fc_layer(self.conv3_2_reshape_2, 'fc4', 2048, 0.1, 0.01) 121 | self.fc4_relu = self.lrelu(self.fc4, self.relu_coef, 'fc4_relu') 122 | 123 | #dynamic filters 124 | self.pre_dy_fc1 = self.fc_layer(self.fc4_relu, 'pre_dy_fc1', 2048, bias_decay = True) 125 | self.pre_dy_fc1_relu = self.lrelu(self.pre_dy_fc1, self.relu_coef, 126 | 'pre_dy_fc1_relu', constant = False) 127 | self.pre_dy_fc2 = self.fc_layer(self.pre_dy_fc1_relu, 128 | 'pre_dy_fc2', 2048, bias_decay = True) 129 | self.pre_dy_fc2_relu = self.lrelu(self.pre_dy_fc2, 1.5 * self.relu_coef, 130 | 'pre_dy_fc2_relu', constant = False) 131 | self.dy_param = self.fc_layer(self.pre_dy_fc2_relu, 132 | 'dy_param', self.output_feature_dim, bias_decay = True) 133 | 134 | self.output = tf.Variable(initial_value = 1.0, trainable = False, 135 | validate_shape = False, dtype = tf.float32) 136 | self.get_output = tf.assign(self.output, self.dy_param, 137 | validate_shape = False) 138 | 139 | #gather weight decays 140 | self.wd = tf.add_n(tf.get_collection('txt_net_weight_decay'), 141 | name = 'txt_net_total_weight_decay') 142 | 143 | def accumulate(self): 144 | #gradients calculation 145 | self.ys = [self.wd, self.dy_param] 146 | self.grad_ys = [1.0, self.output_grad] 147 | 148 | self.gradients = tf.gradients(self.ys, self.varlist, grad_ys = self.grad_ys) 149 | self.gradients_relu = tf.gradients(self.ys, self.varlist_relu, grad_ys = self.grad_ys) 150 | 151 | self.grad_and_vars = [] 152 | self.grad_and_vars_relu = [] 153 | 154 | for idx, var in enumerate(self.varlist): 155 | self.grad_and_vars.append((tf.clip_by_value(self.gradients[idx], -10, 10), var)) 156 | for idx, var in enumerate(self.varlist_relu): 157 | self.grad_and_vars_relu.append((self.gradients_relu[idx], var)) 158 | 159 | with tf.control_dependencies(self.gradients + self.gradients_relu): 160 | self.train_op = tf.group(self.opt.apply_gradients(self.grad_and_vars), 161 | self.opt_relu.apply_gradients(self.grad_and_vars_relu)) 162 | self.safe_ops = {} 163 | for v in self.varlist: 164 | self.safe_ops[v] = tf.assign(v, tf.where(tf.is_finite(v), v, 1e-25 * tf.ones_like(v))) 165 | 166 | def set_input(self, texts): 167 | self.p_texts = texts 168 | 169 | def get_output(self): 170 | return self.p_dy_param 171 | 172 | def forward(self, physical_output = False): 173 | if physical_output: 174 | [self.p_dy_param] = self.sess.run([self.get_output], 175 | feed_dict = {self.texts: 176 | self.p_texts}) 177 | else: 178 | self.sess.run([self.get_output], 179 | feed_dict = {self.texts: self.p_texts}) 180 | 181 | return 182 | 183 | def backward(self): 184 | self.sess.run(self.train_op, feed_dict = {self.texts: self.p_texts}) 185 | return 186 | 187 | #shape: [h, w, in_channel, out_channel] 188 | def conv_layer(self, bottom, name, shape, 189 | strides = [1, 1, 1, 1], weight_init_std = 0.1, bias_init_value = 0.01): 190 | conv_filter = self.get_weight(name, shape, weight_init_std) 191 | biases = self.get_bias(name, shape[3], bias_init_value) 192 | conv = tf.nn.conv2d(bottom, conv_filter, strides, 'SAME') 193 | result = tf.nn.bias_add(conv, biases) 194 | return result 195 | 196 | def fc_layer(self, bottom, name, output_shape, weight_init_std = 0.001, bias_init_value = 0.0, bias_decay = False): 197 | weights = self.get_weight(name, [bottom.get_shape()[1], output_shape], weight_init_std) 198 | biases = self.get_bias(name, [output_shape], bias_init_value, bias_decay) 199 | fc = tf.nn.bias_add(tf.matmul(bottom, weights), biases) 200 | return fc 201 | 202 | def lrelu(self, x, leak = 0.1, name = 'lrelu', constant = True): 203 | if not constant: 204 | if self.data_dict.get(name) is not None: 205 | init = tf.constant_initializer( 206 | value = self.data_dict[name], dtype = tf.float32) 207 | else: 208 | init = tf.constant_initializer(leak) 209 | x_shape = x.get_shape().as_list() 210 | x_shape[0] = 1 211 | with tf.variable_scope(name): 212 | leak = tf.get_variable(name = 'relu_params', initializer = init, 213 | shape = x_shape, dtype = tf.float32) 214 | self.varlist_relu.append(leak) 215 | f1 = 0.5 * (1 + leak) 216 | f2 = 0.5 * (1 - leak) 217 | return f1 * x + f2 * abs(x) 218 | 219 | def get_bias(self, name, shape, init_value = 0.0, weight_decay = False): 220 | if self.data_dict.get(name): 221 | init = tf.constant_initializer( 222 | value = self.data_dict[name]['biases'], dtype = tf.float32) 223 | else: 224 | init = tf.constant_initializer(init_value) 225 | print('[WARNING]This is random init!') 226 | with tf.variable_scope(name): 227 | var = tf.get_variable(name = 'bias', initializer = init, 228 | shape = shape, dtype = tf.float32) 229 | weight_decay = tf.multiply(tf.nn.l2_loss(var), self.weight_decay, 230 | name = 'weight_decay') 231 | tf.add_to_collection('txt_net_weight_decay', weight_decay) 232 | self.varlist.append(var) 233 | return var 234 | 235 | def get_weight(self, name, shape, init_std = 0.01): 236 | if self.data_dict.get(name): 237 | init = tf.constant_initializer( 238 | value = self.data_dict[name]['weights'], dtype = tf.float32) 239 | else: 240 | init = tf.random_normal_initializer(mean = 0.0, stddev = init_std) 241 | print('[WARNING]This is random init!') 242 | with tf.variable_scope(name): 243 | var = tf.get_variable(name = 'weights', initializer = init, 244 | shape = shape, dtype = tf.float32) 245 | weight_decay = tf.multiply(tf.nn.l2_loss(var), self.weight_decay, 246 | name = 'weight_decay') 247 | tf.add_to_collection('txt_net_weight_decay', weight_decay) 248 | self.varlist.append(var) 249 | return var 250 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | cv2==1.0 2 | easydict==1.7 3 | numpy==1.12.1 4 | scipy==0.19.0 5 | tensorflow-gpu==1.1.0 6 | Pillow 7 | pyx 8 | -------------------------------------------------------------------------------- /setup.sh: -------------------------------------------------------------------------------- 1 | #!/bin/sh 2 | # =========================== 3 | # Usage: ./setup.sh (model|data)? 4 | 5 | if wget --help | grep -q 'show-progress'; then 6 | WGET_FLAG="-q --show-progress" 7 | else 8 | WGET_FLAG="" 9 | fi 10 | 11 | # create a tmp directory for the downloading data 12 | TMP_DIR="./tmp_download" 13 | mkdir -p "${TMP_DIR}" 14 | 15 | # downloading model 16 | download_model() 17 | { 18 | # directory for model 19 | MODEL_TAR_BALL="${TMP_DIR}/pretrained_model.tar.gz" 20 | MODEL_DIR="${TMP_DIR}/pretrained_model" 21 | mkdir -p "${MODEL_DIR}" 22 | 23 | MODEL_URL="http://www.ytzhang.net/files/dbnet/tensorflow/dbnet-pretrained.tar.gz" 24 | echo "Downloading pre-trained models ..." 25 | wget ${WGET_FLAG} "${MODEL_URL}" -O "${MODEL_TAR_BALL}" 26 | echo "Uncompressing pre-trained models ..." 27 | tar -xzf "${MODEL_TAR_BALL}" -C "${MODEL_DIR}" 28 | 29 | # move model to default directories 30 | VGG_REGION_NET_DIR="./networks/image_feat_net/vgg16" 31 | RESNET_REGION_NET_DIR="./networks/image_feat_net/resnet101" 32 | TEXT_NET_DIR="./networks/text_feat_net" 33 | echo "Move pre-trained image network model to ${VGG_REGION_NET_DIR} ..." 34 | mv ${MODEL_DIR}/vgg16_Region_Feat_Net.npy "${VGG_REGION_NET_DIR}/Region_Feat_Net.npy" 35 | mv ${MODEL_DIR}/vgg16_frcnn_Region_Feat_Net.npy "${VGG_REGION_NET_DIR}/frcnn_Region_Feat_Net.npy" 36 | echo "Move pre-trained image network model to ${RESNET_REGION_NET_DIR} ..." 37 | mv ${MODEL_DIR}/resnet101_Region_Feat_Net.npy "${RESNET_REGION_NET_DIR}/Region_Feat_Net.npy" 38 | mv ${MODEL_DIR}/resnet101_frcnn_Region_Feat_Net.npy "${RESNET_REGION_NET_DIR}/frcnn_Region_Feat_Net.npy" 39 | echo "Move pre-trained text network model to ${TEXT_NET_DIR} ..." 40 | mv ${MODEL_DIR}/*Text*.npy "${TEXT_NET_DIR}" 41 | } 42 | 43 | # downloading data 44 | download_data() 45 | { 46 | # directory for data 47 | DATA_TAR_BALL="${TMP_DIR}/data.tar.gz" 48 | DATA_DIR="./data" 49 | mkdir -p "${DATA_DIR}" 50 | 51 | DATA_URL="http://www.ytzhang.net/files/dbnet/data/vg_v1_json_.tar.gz" 52 | echo "Downloading data ..." 53 | wget ${WGET_FLAG} "${DATA_URL}" -O "${DATA_TAR_BALL}" 54 | echo "Uncompressing data ..." 55 | tar -xzf "${DATA_TAR_BALL}" -C "${DATA_DIR}" 56 | } 57 | 58 | # default to download all 59 | if [ $# -eq 0 ]; then 60 | download_model 61 | download_data 62 | else 63 | case $1 in 64 | "model") download_model 65 | ;; 66 | "data") download_data 67 | ;; 68 | *) echo "Usage: ./setup.sh [OPTION]" 69 | echo "" 70 | echo "No option will download both model and data." 71 | echo "" 72 | echo "OPTION:\n\tmodel: only download the pre-trained models (.npy)" 73 | echo "\tdata: only download the data(.json)" 74 | ;; 75 | esac 76 | fi 77 | 78 | # clear the tmp files 79 | rm -rf "${TMP_DIR}" 80 | -------------------------------------------------------------------------------- /test.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import os 3 | import json 4 | import scipy.io as sio 5 | import cv2 6 | import itertools 7 | import operator 8 | import time 9 | import sys 10 | import pdb 11 | from utils import get_scaled_im_tensor, get_scaled_roi,\ 12 | get_txt_tensor, im2rid, rid2r 13 | from config import ENV_PATHS, DS_CONFIG 14 | sys.path.append(ENV_PATHS.EDGE_BOX_RPN) 15 | 16 | level1_im2p = json.load(open(ENV_PATHS.LEVEL1_TEST, 'r')) 17 | level2_im2p = json.load(open(ENV_PATHS.LEVEL2_TEST, 'r')) 18 | 19 | def get_edgeboxes_test(img_id, top_num): 20 | try: 21 | raw_boxes = sio.loadmat(os.path.join(ENV_PATHS.EDGEBOX_PATH, 22 | str(img_id) + '.mat'))['bbs'][0: top_num, :] 23 | except: 24 | import edge_boxes 25 | raw_boxes_ = edge_boxes.get_windows([img_id])[0][0: top_num, :] 26 | raw_boxes = np.zeros(raw_boxes_.shape) 27 | raw_boxes[:, 0] = raw_boxes_[:, 1] 28 | raw_boxes[:, 1] = raw_boxes_[:, 0] 29 | raw_boxes[:, 2] = raw_boxes_[:, 3] - raw_boxes_[:, 1] + 1 30 | raw_boxes[:, 3] = raw_boxes_[:, 2] - raw_boxes_[:, 0] + 1 31 | 32 | edge_boxes = np.zeros((0,4)) 33 | edge_boxes = np.concatenate((edge_boxes, raw_boxes[:, 0:4])) 34 | return edge_boxes 35 | 36 | # NMS referenced from http://www.pyimagesearch.com/2015/02/16/faster-non-maximum-suppression-python/ 37 | # for each box, in the format [x1, y1, x2, y2, score] 38 | def non_max_suppression(boxes, overlapThresh): 39 | # if there are no boxes, return an empty list 40 | if len(boxes) == 0: 41 | return [] 42 | 43 | # if the bounding boxes integers, convert them to floats -- 44 | # this is important since we'll be doing a bunch of divisions 45 | if boxes.dtype.kind == "i": 46 | boxes = boxes.astype("float") 47 | 48 | # initialize the list of picked indexes 49 | pick = [] 50 | 51 | # grab the coordinates of the bounding boxes 52 | x1 = boxes[:,0] 53 | y1 = boxes[:,1] 54 | x2 = boxes[:,2] 55 | y2 = boxes[:,3] 56 | scores = boxes[:,4] 57 | 58 | # compute the area of the bounding boxes and sort the bounding 59 | # boxes by the score 60 | area = (x2 - x1 + 1) * (y2 - y1 + 1) 61 | idxs = np.argsort(scores) 62 | 63 | # keep looping while some indexes still remain in the indexes 64 | # list 65 | while len(idxs) > 0: 66 | # grab the last index in the indexes list and add the 67 | # index value to the list of picked indexes 68 | last = len(idxs) - 1 69 | i = idxs[last] 70 | pick.append(i) 71 | 72 | # find the largest (x, y) coordinates for the start of 73 | # the bounding box and the smallest (x, y) coordinates 74 | # for the end of the bounding box 75 | xx1 = np.maximum(x1[i], x1[idxs[:last]]) 76 | yy1 = np.maximum(y1[i], y1[idxs[:last]]) 77 | xx2 = np.minimum(x2[i], x2[idxs[:last]]) 78 | yy2 = np.minimum(y2[i], y2[idxs[:last]]) 79 | 80 | # compute the width and height of the bounding box 81 | w = np.maximum(0, xx2 - xx1 + 1) 82 | h = np.maximum(0, yy2 - yy1 + 1) 83 | 84 | # compute the ratio of overlap 85 | overlap = (w * h) / (area[i] + area[idxs[:last]] - w * h) 86 | 87 | # delete all indexes from the index list that have 88 | idxs = np.delete(idxs, np.concatenate(([last], 89 | np.where(overlap > overlapThresh)[0]))) 90 | 91 | # return only the bounding boxes that were picked using the 92 | # integer data type 93 | return boxes[pick] 94 | 95 | def get_pairs_test(img_id, level, edge_box_max, gt_box): 96 | # (region_id, t_id) 97 | pair_list = [] 98 | region_ids = im2rid.get(str(img_id)) 99 | if region_ids is None: 100 | region_ids = [] 101 | edgebox_regions = get_edgeboxes_test(img_id, edge_box_max) 102 | edgebox_id = "edgebox_%s" % str(img_id) 103 | region_id = "region_%s" % str(img_id) 104 | 105 | # need to determine how to define the ids 106 | # currently __ 107 | region_dict = {} 108 | counter = 0 109 | for i in range(edgebox_regions.shape[0]): 110 | e_id = edgebox_id + "_%d" % counter 111 | region_dict[e_id] = edgebox_regions[i, :] 112 | counter += 1 113 | 114 | if gt_box: 115 | phrase_ids = [] 116 | counter = 0 117 | for rid in region_ids: 118 | r_info = rid2r[str(rid)] 119 | r_coord = [r_info['x'], r_info['y'], r_info['width'], r_info['height']] 120 | r_id = region_id + "_%d" % counter 121 | region_dict[r_id] = np.array(r_coord) 122 | # genenrate phrase ids 123 | phrase_ids.append(r_info['categ_id']) 124 | counter += 1 125 | else: 126 | # genenrate phrase ids 127 | phrase_ids = [rid2r[str(r)]['categ_id'] for r in region_ids] 128 | 129 | if level == 'level_1': 130 | phrase_ids = level1_im2p[str(img_id)] 131 | elif level == 'level_2': 132 | phrase_ids = level2_im2p[str(img_id)] 133 | elif level == 'vis': 134 | phrase_ids = [-1] 135 | elif level != 'level_0': 136 | print('wrong LEVEL parameter, ') 137 | assert(0) 138 | 139 | # generate pair 140 | pair_list = [(r_id, t_id) for t_id in phrase_ids for r_id in region_dict] 141 | 142 | return pair_list, region_dict 143 | 144 | def get_data(img_id, level, edge_box_max, gt_box, query_phrase = None): 145 | image_tensor, scale, shape = get_scaled_im_tensor([img_id], 146 | DS_CONFIG.target_size, 147 | DS_CONFIG.max_size) 148 | all_rois = np.zeros((0,5)) 149 | 150 | # start gathering data for the testing image 151 | pair_list, region_dict = get_pairs_test(img_id, level, edge_box_max, gt_box) 152 | rois_list = [pair[0] for pair in pair_list] 153 | phrases_list = [pair[1] for pair in pair_list] 154 | unique_rois_ids, inverse_region_ids = ( 155 | np.unique(rois_list, return_inverse = True)) 156 | test_rois = get_scaled_roi(unique_rois_ids, region_dict, 157 | scale[0], shape[0], 0) 158 | all_rois = np.concatenate((all_rois, test_rois)) 159 | 160 | unique_phrase_ids, inverse_phrase_ids = ( 161 | np.unique(phrases_list, return_inverse = True)) 162 | phrase_tensor = get_txt_tensor(unique_phrase_ids, query_phrase) 163 | 164 | return (pair_list, region_dict, 165 | {'raw_phrase': query_phrase, 166 | 'images': image_tensor, 167 | 'phrases': phrase_tensor, 168 | 'rois': all_rois, 169 | 'phrase_ids': inverse_phrase_ids, 170 | 'roi_ids': inverse_region_ids, 171 | 'labels': None, 172 | 'loss_weights': None, 173 | 'sources': None}) 174 | 175 | def test_output(img_id, phrase2r_dict, level, output_dir): 176 | os.makedirs("%s/tmp_output" % output_dir, exist_ok = True) 177 | f = open("%s/tmp_output/%s_%d.txt" % (output_dir, level, img_id), "w+") 178 | f.write(str(img_id) + ":") 179 | for t_id in phrase2r_dict: 180 | f.write("\n\t%s:" % t_id) 181 | # output the region informations 182 | for region in phrase2r_dict[t_id]: 183 | #write in order [y1, x1, y2, x2] 184 | f.write(" [%d, %d, %d, %d, %.6f]" % 185 | (region[1], region[0], region[3], region[2], region[4])) 186 | f.write("\n") 187 | f.close() 188 | 189 | def test(net, img_id, level, output_dir, top_num = 10, gt_box = False, query_phrase = None): 190 | if query_phrase is not None: 191 | assert(level == 'vis') 192 | t0 = time.time() 193 | pair_list, region_dict, data_dict = get_data(img_id, level, top_num, gt_box, query_phrase) 194 | net.set_input(data_dict) 195 | net.forward(False, False) 196 | scores = net.get_output()[1] 197 | scores = [s[0] for s in scores] 198 | t1 = time.time() 199 | print ("run through the network takes %f" % (t1 - t0)) 200 | 201 | # build region np array for nms 202 | phrase2r_dict = {} 203 | combined_region_score = [pair_list[i] + (scores[i],) 204 | for i in range(len(scores))] 205 | for key, group in itertools.groupby(combined_region_score, 206 | operator.itemgetter(1)): 207 | # [x, y, w, h, score] 208 | regions_info = np.array([np.append(region_dict[info[0]], info[2]) 209 | for info in list(group)]) 210 | # change from [x, y, w, h] to [x1, y1, x2, y2] 211 | regions_info[:,2] += regions_info[:,0] - 1 212 | regions_info[:,3] += regions_info[:,1] - 1 213 | 214 | # apply nms on the top score regions 215 | regions_info = np.array( 216 | sorted(regions_info, key = lambda row: row[4])[::-1]) 217 | regions_info_nms = non_max_suppression(regions_info, 0.3) 218 | phrase2r_dict[key] = regions_info_nms 219 | 220 | t2 = time.time() 221 | print ("run through the nms takes %f" % (t2 - t1)) 222 | if query_phrase is None: 223 | test_output(img_id, phrase2r_dict, level, output_dir) 224 | print ("FINISH TESTING %s" % str(img_id)) 225 | return phrase2r_dict 226 | 227 | -------------------------------------------------------------------------------- /utils.py: -------------------------------------------------------------------------------- 1 | """ 2 | Details: 3 | load the phrase json and track the image 4 | generate ground truth region & text phrase pair dict 5 | images need resize 6 | region id --> (x,y,w,h) 7 | """ 8 | import numpy as np 9 | import os.path as osp 10 | import cv2 11 | import string 12 | import json 13 | import re 14 | from PIL import Image 15 | import pyx 16 | import time 17 | import scipy.io as sio 18 | import scipy.ndimage.interpolation as sni 19 | from collections import defaultdict 20 | from config import DS_CONFIG, ENV_PATHS 21 | 22 | #all ids are integer type 23 | t0 = time.clock() 24 | print('') 25 | raw_data = json.load(open(ENV_PATHS.RAW_DATA, 'r')) 26 | im2rid = raw_data['im2rid'] 27 | rid2r = raw_data['rid2r'] 28 | tid2p = raw_data['tid2p'] 29 | valid_tids = [int(k) for k in list(tid2p.keys())] 30 | 31 | meteor = np.array(json.load(open(ENV_PATHS.METEOR, 'r'))) 32 | vocab = [c for c in string.printable if c not in string.ascii_uppercase] 33 | VGG_MEAN = [103.939, 116.779, 123.68] 34 | phrase_freq_temp = json.load(open(ENV_PATHS.FREQUENCY, 'r'))['freq'] 35 | phrase_freq = np.zeros(len(phrase_freq_temp)) 36 | for _, tid in enumerate(valid_tids): 37 | phrase_freq[tid - 1] = phrase_freq_temp[tid - 1] 38 | total_frequency = np.sum(phrase_freq) 39 | phrase_prob = phrase_freq / total_frequency 40 | 41 | # split 42 | raw_split = json.load(open(ENV_PATHS.SPLIT, 'r')) 43 | train_ids = raw_split['train'] 44 | test_ids = raw_split['test'] 45 | val_ids = raw_split['val'] 46 | print('' % (time.clock() - t0)) 47 | 48 | #[x, y, width, height] 49 | def IoU(region_current, object_current): 50 | totalarea = (object_current[2] * object_current[3] + 51 | region_current[2] * region_current[3]) 52 | 53 | if region_current[0] <= object_current[0]: 54 | x_left = object_current[0] 55 | else: 56 | x_left = region_current[0] 57 | 58 | if region_current[1] <= object_current[1]: 59 | y_left = object_current[1] 60 | else: 61 | y_left = region_current[1] 62 | 63 | if (region_current[0] + region_current[2] >= 64 | object_current[0] + object_current[2]): 65 | x_right = object_current[0] + object_current[2] 66 | else: 67 | x_right= region_current[0] + region_current[2] 68 | 69 | if (region_current[1] + region_current[3] >= 70 | object_current[1] + object_current[3]): 71 | y_right = object_current[1] + object_current[3] 72 | else: 73 | y_right= region_current[1] + region_current[3] 74 | 75 | if x_right <= x_left: 76 | intersection = 0 77 | elif y_right <= y_left: 78 | intersection = 0 79 | else: 80 | intersection = (x_right - x_left) * (y_right - y_left) 81 | union = totalarea - intersection 82 | 83 | return 1.0 * intersection / union 84 | 85 | def get_edgeboxes(img_id): 86 | raw_boxes = sio.loadmat(osp.join(ENV_PATHS.EDGEBOX_PATH, str(img_id) + '.mat'))['bbs'] 87 | chosen_boxes = np.zeros((0, 4)) 88 | chosen_boxes = np.concatenate((chosen_boxes, 89 | raw_boxes[0: DS_CONFIG.edge_box_high_rank_num, 0:4])) 90 | rand_ids = (np.random.permutation( 91 | raw_boxes.shape[0] - 92 | DS_CONFIG.edge_box_high_rank_num)[0: DS_CONFIG.edge_box_random_num] 93 | + DS_CONFIG.edge_box_high_rank_num) 94 | chosen_boxes = np.concatenate((chosen_boxes, raw_boxes[rand_ids, 0:4])) 95 | return chosen_boxes 96 | 97 | def get_label_same_img(img_id): 98 | #(region_id, t_id): label<0|1|2> 99 | pair2dict = {} 100 | ambiguous = {} 101 | region_ids = im2rid[str(img_id)] 102 | edgebox_regions = get_edgeboxes(img_id) 103 | 104 | #get a local dictionary for all regions 105 | local_region_dict = {} 106 | counter = 0 107 | for i in range(edgebox_regions.shape[0]): 108 | local_region_dict[counter] = edgebox_regions[i, :] 109 | counter += 1 110 | for rid in region_ids: 111 | r_info = rid2r[str(rid)] 112 | r_coord = [r_info['x'], r_info['y'], r_info['width'], r_info['height']] 113 | local_region_dict[counter] = np.array(r_coord) 114 | counter += 1 115 | 116 | #get labels based on IOU 117 | for r1 in local_region_dict: 118 | r1_coord = local_region_dict[r1] 119 | ambiguous[r1] = [] 120 | for r2 in region_ids: 121 | t = rid2r[str(r2)]['categ_id'] 122 | r2_info = rid2r[str(r2)] 123 | r2_coord = [r2_info['x'], r2_info['y'], 124 | r2_info['width'], r2_info['height']] 125 | iou = IoU(r1_coord, r2_coord) 126 | if iou <= DS_CONFIG.thre_neg: 127 | #always get the maximum iou 128 | if (pair2dict.get((r1, t)) is not None and 129 | (pair2dict[(r1, t)] == 1 or pair2dict[(r1, t)] == 2)): 130 | continue 131 | pair2dict[(r1, t)] = 0 132 | elif iou >= DS_CONFIG.thre_pos: 133 | if t in ambiguous[r1]: 134 | ambiguous[r1].remove(t) 135 | pair2dict[(r1, t)] = 1 136 | else: 137 | #always get the maximum iou 138 | if (pair2dict.get((r1, t)) is not None and 139 | pair2dict[(r1,t)] == 1): 140 | continue 141 | pair2dict[(r1,t)] = 2 142 | ambiguous[r1].append(t) 143 | 144 | return pair2dict, local_region_dict, ambiguous 145 | 146 | def get_label_diff_img(img_id, ambiguous): 147 | pair2dict = {} 148 | region_ids = im2rid[str(img_id)] 149 | #get random sampled phrases out of the given image 150 | t_ids_in_image = [rid2r[str(r_id)]['categ_id'] for r_id in region_ids] 151 | rand_t_ids = [] 152 | check_next_tid = 0 153 | temp_rand = (np.random.choice(len(phrase_prob), 154 | int(1.1 * DS_CONFIG.text_rand_sample_size), 155 | p = phrase_prob, replace = True) + 1) 156 | while (len(rand_t_ids) < DS_CONFIG.text_rand_sample_size 157 | and check_next_tid < len(temp_rand)): 158 | if temp_rand[check_next_tid] not in t_ids_in_image: 159 | rand_t_ids.append(temp_rand[check_next_tid]) 160 | check_next_tid += 1 161 | 162 | for t_id in rand_t_ids: 163 | for r_id in ambiguous: 164 | current_ambiguous = ambiguous[r_id] 165 | pair2dict[(r_id, t_id)] = 0 166 | for a_id in current_ambiguous: 167 | #upper triangle matrix 168 | if max(t_id, a_id) - 1 >= meteor.shape[0]: 169 | continue 170 | elif meteor[min(t_id, a_id) - 1, 171 | max(t_id, a_id) - 1] > DS_CONFIG.meteor_thred: 172 | pair2dict[(r_id, t_id)] = 2 173 | break 174 | return pair2dict 175 | 176 | def get_label_together(img_id): 177 | pair2dict_same, local_region_dict, ambiguous = get_label_same_img(img_id) 178 | pair2dict_diff = get_label_diff_img(img_id, ambiguous) 179 | data_book = [] 180 | #(region_id, categ_id, label, loss_weights, category) 181 | for b_id, current_p2d in enumerate([pair2dict_same, pair2dict_diff]): 182 | for k in current_p2d: 183 | if current_p2d[k] == 1: 184 | data_book.append([k[0], k[1], 1, DS_CONFIG.pos_loss_weight, 1]) 185 | elif current_p2d[k] == 0 and b_id == 0:#same image negative 186 | data_book.append([k[0], k[1], 0, DS_CONFIG.neg_loss_weight, 0]) 187 | elif current_p2d[k] == 0 and b_id == 1:#diff image negative 188 | data_book.append([k[0], k[1], 0, DS_CONFIG.rest_loss_weight, 2]) 189 | return np.array(data_book), local_region_dict 190 | 191 | # input a batch of images id 192 | def get_scaled_im_tensor(img_ids, target_size, max_size): 193 | images = [] 194 | scales = [] 195 | img_shapes = [] 196 | max_w = -1 197 | max_h = -1 198 | # load each image 199 | for img_id in img_ids: 200 | im_path = osp.join(ENV_PATHS.IMAGE_PATH, str(img_id) + '.jpg') 201 | try: 202 | img = cv2.imread(im_path).astype('float') 203 | except: 204 | img = cv2.imread(img_id).astype('float') 205 | img_shapes.append([img.shape[1], img.shape[0]]) #(limit_x, limit_y) 206 | # calculate scale 207 | old_short = min(img.shape[0: 2]) 208 | old_long = max(img.shape[0: 2]) 209 | new_scale = 1.0 * target_size / old_short 210 | if old_long * new_scale > max_size: 211 | new_scale = 1.0 * max_size / old_long 212 | # subtract mean from the image 213 | img[:, :, 0] = img[:, :, 0] - VGG_MEAN[0] 214 | img[:, :, 1] = img[:, :, 1] - VGG_MEAN[1] 215 | img[:, :, 2] = img[:, :, 2] - VGG_MEAN[2] 216 | # scale the image 217 | img = cv2.resize(img, None, fx = new_scale, fy = new_scale, 218 | interpolation = cv2.INTER_LINEAR) 219 | images.append(img) 220 | scales.append([new_scale, new_scale]) 221 | # find the max shape 222 | if img.shape[0] > max_h: 223 | max_h = img.shape[0] 224 | if img.shape[1] > max_w: 225 | max_w = img.shape[1] 226 | # padding the image to be the max size with 0 227 | for idx, img in enumerate(images): 228 | resize_h = max_h - img.shape[0] 229 | resize_w = max_w - img.shape[1] 230 | images[idx] = cv2.copyMakeBorder(img, 0, resize_h, 0, resize_w, 231 | cv2.BORDER_CONSTANT, value=(0,0,0)) 232 | 233 | return np.array(images), np.array(scales), np.array(img_shapes) 234 | 235 | def get_txt_tensor(phrase_ids, phrases = None): 236 | if phrases is None: 237 | phrases = [tid2p[str(int(id))] for id in phrase_ids] 238 | else: 239 | assert(phrase_ids[0] == -1 and len(phrase_ids) == 1) 240 | tensor = np.zeros([len(phrases), 1, 241 | DS_CONFIG.text_tensor_sequence_length, len(vocab)]) 242 | for idx, line in enumerate(phrases): 243 | line = line.encode('ascii', errors='ignore') 244 | line = line.decode('utf-8') 245 | if line[-1] != '.': 246 | line = line + ' .' 247 | line = re.sub(' +', ' ', line) 248 | line = line.lower() 249 | line = [char for char in line if char in vocab] 250 | line = ''.join(line) 251 | #repeat the phrase to fixed length 252 | while len(line) < DS_CONFIG.text_tensor_sequence_length: 253 | line = line + ' ' + line 254 | for i in range(DS_CONFIG.text_tensor_sequence_length): 255 | tensor[idx, 0, i, vocab.index(line[i])] = 1 256 | return tensor 257 | 258 | #scale: [x_scale, y_scale] 259 | #shape: [limit_x, limit_y] 260 | #return: [xmin, ymin, xmax, ymax] 261 | #use this to decode local roi dict 262 | def get_scaled_roi(roi_ids, roi_dict, scale, shape, batch_idx, area_thred = 49): 263 | rois = [] 264 | for idx in roi_ids: 265 | roi_coor = roi_dict[idx] 266 | if roi_coor[2] * roi_coor[3] < area_thred: 267 | continue 268 | temp_roi = [roi_coor[0] - 1, roi_coor[1] - 1, 269 | roi_coor[0] + roi_coor[2] - 2 , roi_coor[1] + roi_coor[3] - 2] 270 | rois.append([batch_idx, temp_roi[0] * scale[0], #1-base -> 0-base 271 | temp_roi[1] * scale[1], 272 | temp_roi[2] * scale[0], 273 | temp_roi[3] * scale[1]]) 274 | return np.array(rois) 275 | 276 | #get all needed data from an image 277 | #image_tensor: [batch_size, width, height, 3] 278 | #phrase_tensor: [num_phrases, 1, sequence_length, vocab_size] 279 | #rois:[batch_idx, xmin, ymin, xmax, ymax] 280 | #pair: [rois_idx, phrase_idx] 281 | #labels|loss_weights: same length as pair 282 | def get_data(img_ids): 283 | image_tensor, scales, shapes = get_scaled_im_tensor(img_ids, DS_CONFIG.target_size, 284 | DS_CONFIG.max_size) 285 | all_labels = np.zeros((0,)) 286 | all_sources = np.zeros((0,)) 287 | all_loss_weights = np.zeros((0,)) 288 | inverse_region_ids = np.zeros((0,)) 289 | all_rois = np.zeros((0, 5)) 290 | phrases_accumulate = np.zeros((0,)) 291 | #unique roi index is calculated by batch, 292 | #when used globally should be offset 293 | unique_roi_index_offset = 0 294 | for idx, img_id in enumerate(img_ids): 295 | current_data_book, current_region_dict = get_label_together(img_id) 296 | current_unique_rois_ids, current_inverse_ids = ( 297 | np.unique(current_data_book[:, 0], return_inverse = True)) 298 | current_rois = get_scaled_roi(current_unique_rois_ids, 299 | current_region_dict, 300 | scales[idx, :], shapes[idx,:], idx) 301 | all_rois = np.concatenate((all_rois, current_rois)) 302 | inverse_region_ids = np.concatenate((inverse_region_ids, 303 | current_inverse_ids + unique_roi_index_offset)) 304 | unique_roi_index_offset += current_rois.shape[0] 305 | #phrase id is unique globally 306 | phrases_accumulate = np.concatenate((phrases_accumulate, 307 | current_data_book[:, 1])) 308 | all_labels = np.concatenate((all_labels, current_data_book[:, 2])) 309 | all_loss_weights = np.concatenate((all_loss_weights, 310 | current_data_book[:, 3])) 311 | all_sources = np.concatenate((all_sources, current_data_book[:, 4])) 312 | #get phrase tensor 313 | unique_phrase_ids, inverse_phrase_ids = np.unique(phrases_accumulate, 314 | return_inverse = True) 315 | phrase_tensor = get_txt_tensor(unique_phrase_ids) 316 | 317 | assert inverse_phrase_ids.shape[0] == inverse_region_ids.shape[0] 318 | assert inverse_phrase_ids.shape[0] == all_labels.shape[0] 319 | 320 | return {'image_ids': img_ids, #for track and debug 321 | 'phrase_ids': unique_phrase_ids, #for track and debug 322 | 'images': image_tensor, 323 | 'phrases': phrase_tensor, 324 | 'rois': all_rois, 325 | 'phrase_ids': inverse_phrase_ids, 326 | 'roi_ids': inverse_region_ids, 327 | 'labels': all_labels, 328 | 'loss_weights': all_loss_weights, 329 | 'sources': all_sources} 330 | 331 | def visualize(im_idx, phrase2ranked, visual_num, phrase, save_path): 332 | #read in image 333 | try: 334 | image = Image.open(osp.join(ENV_PATHS.IMAGE_PATH, str(im_idx) + '.jpg')) 335 | except: 336 | image = Image.open(im_idx) 337 | im_w, im_h = image.size 338 | ratio = 0.3 339 | pyx.text.set(mode="latex") 340 | pyx.text.preamble(r"\renewcommand{\familydefault}{\sfdefault}") 341 | canv = pyx.canvas.canvas() 342 | canv.insert(pyx.bitmap.bitmap(0, 0, image, width = ratio * im_w, height = ratio * im_h)) 343 | assert(len(phrase2ranked) == 1) 344 | ranked = list(phrase2ranked.values())[0] 345 | for i in range(visual_num): 346 | (x1, y1, x2, y2, s) = ranked[i] 347 | w = int(x2 - x1) 348 | h = int(y2 - y1) 349 | canv.stroke(pyx.path.rect(ratio * x1, ratio * (im_h - y2), ratio * w, ratio * h), 350 | [pyx.style.linewidth(1.0), pyx.color.rgb.red]) 351 | #insert score tab for each bbox 352 | pyx.unit.set(xscale = 3) 353 | tbox = pyx.text.text(ratio * x1, ratio * (im_h - y1), '[%f]:%s' % (s, phrase), [pyx.text.size.Huge]) 354 | tpath = tbox.bbox().enlarged(3 * pyx.unit.x_pt).path() 355 | canv.draw(tpath, [pyx.deco.filled([pyx.color.cmyk.Yellow]), pyx.deco.stroked()]) 356 | canv.insert(tbox) 357 | 358 | canv.writePDFfile(save_path) 359 | 360 | if __name__ == '__main__': 361 | main() 362 | --------------------------------------------------------------------------------