├── Dockerfile ├── LICENSE ├── README.md ├── calculate_similarity.py ├── datasets ├── __init__.py ├── activity_net.pickle ├── cc_web_video.pickle ├── evve.pickle └── fivr.pickle ├── evaluation.py ├── examples ├── video1.gif └── video2.gif ├── model ├── __init__.py ├── layers.py ├── nets │ ├── __init__.py │ ├── i3d.py │ ├── resnet_utils.py │ ├── resnet_v1.py │ └── vgg_preprocessing.py ├── similarity.py └── visil.py ├── requirements.txt └── video_similarity.png /Dockerfile: -------------------------------------------------------------------------------- 1 | FROM tensorflow/tensorflow:1.15.0-gpu-py3 2 | 3 | WORKDIR / 4 | 5 | RUN apt-get update --fix-missing && \ 6 | apt-get clean && \ 7 | apt-get install -y libsm6 libxext6 libxrender-dev libgl1-mesa-glx git wget 8 | 9 | RUN git clone --depth 1 https://github.com/MKLab-ITI/visil /visil 10 | 11 | RUN wget http://ndd.iti.gr/visil/ckpt.zip 12 | 13 | RUN unzip ckpt.zip -d /visil 14 | 15 | RUN python -m pip install --upgrade pip 16 | 17 | RUN python -m pip install --upgrade numpy tqdm>=4.2 opencv-python>=3.1.0 18 | 19 | RUN python -m pip install --upgrade tensorflow-probability==0.7 dm-sonnet==1.25 20 | 21 | WORKDIR /visil 22 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Apache License 2 | Version 2.0, January 2004 3 | http://www.apache.org/licenses/ 4 | 5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 6 | 7 | 1. Definitions. 8 | 9 | "License" shall mean the terms and conditions for use, reproduction, 10 | and distribution as defined by Sections 1 through 9 of this document. 11 | 12 | "Licensor" shall mean the copyright owner or entity authorized by 13 | the copyright owner that is granting the License. 14 | 15 | "Legal Entity" shall mean the union of the acting entity and all 16 | other entities that control, are controlled by, or are under common 17 | control with that entity. For the purposes of this definition, 18 | "control" means (i) the power, direct or indirect, to cause the 19 | direction or management of such entity, whether by contract or 20 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 21 | outstanding shares, or (iii) beneficial ownership of such entity. 22 | 23 | "You" (or "Your") shall mean an individual or Legal Entity 24 | exercising permissions granted by this License. 25 | 26 | "Source" form shall mean the preferred form for making modifications, 27 | including but not limited to software source code, documentation 28 | source, and configuration files. 29 | 30 | "Object" form shall mean any form resulting from mechanical 31 | transformation or translation of a Source form, including but 32 | not limited to compiled object code, generated documentation, 33 | and conversions to other media types. 34 | 35 | "Work" shall mean the work of authorship, whether in Source or 36 | Object form, made available under the License, as indicated by a 37 | copyright notice that is included in or attached to the work 38 | (an example is provided in the Appendix below). 39 | 40 | "Derivative Works" shall mean any work, whether in Source or Object 41 | form, that is based on (or derived from) the Work and for which the 42 | editorial revisions, annotations, elaborations, or other modifications 43 | represent, as a whole, an original work of authorship. For the purposes 44 | of this License, Derivative Works shall not include works that remain 45 | separable from, or merely link (or bind by name) to the interfaces of, 46 | the Work and Derivative Works thereof. 47 | 48 | "Contribution" shall mean any work of authorship, including 49 | the original version of the Work and any modifications or additions 50 | to that Work or Derivative Works thereof, that is intentionally 51 | submitted to Licensor for inclusion in the Work by the copyright owner 52 | or by an individual or Legal Entity authorized to submit on behalf of 53 | the copyright owner. For the purposes of this definition, "submitted" 54 | means any form of electronic, verbal, or written communication sent 55 | to the Licensor or its representatives, including but not limited to 56 | communication on electronic mailing lists, source code control systems, 57 | and issue tracking systems that are managed by, or on behalf of, the 58 | Licensor for the purpose of discussing and improving the Work, but 59 | excluding communication that is conspicuously marked or otherwise 60 | designated in writing by the copyright owner as "Not a Contribution." 61 | 62 | "Contributor" shall mean Licensor and any individual or Legal Entity 63 | on behalf of whom a Contribution has been received by Licensor and 64 | subsequently incorporated within the Work. 65 | 66 | 2. Grant of Copyright License. Subject to the terms and conditions of 67 | this License, each Contributor hereby grants to You a perpetual, 68 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 69 | copyright license to reproduce, prepare Derivative Works of, 70 | publicly display, publicly perform, sublicense, and distribute the 71 | Work and such Derivative Works in Source or Object form. 72 | 73 | 3. Grant of Patent License. Subject to the terms and conditions of 74 | this License, each Contributor hereby grants to You a perpetual, 75 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 76 | (except as stated in this section) patent license to make, have made, 77 | use, offer to sell, sell, import, and otherwise transfer the Work, 78 | where such license applies only to those patent claims licensable 79 | by such Contributor that are necessarily infringed by their 80 | Contribution(s) alone or by combination of their Contribution(s) 81 | with the Work to which such Contribution(s) was submitted. If You 82 | institute patent litigation against any entity (including a 83 | cross-claim or counterclaim in a lawsuit) alleging that the Work 84 | or a Contribution incorporated within the Work constitutes direct 85 | or contributory patent infringement, then any patent licenses 86 | granted to You under this License for that Work shall terminate 87 | as of the date such litigation is filed. 88 | 89 | 4. Redistribution. You may reproduce and distribute copies of the 90 | Work or Derivative Works thereof in any medium, with or without 91 | modifications, and in Source or Object form, provided that You 92 | meet the following conditions: 93 | 94 | (a) You must give any other recipients of the Work or 95 | Derivative Works a copy of this License; and 96 | 97 | (b) You must cause any modified files to carry prominent notices 98 | stating that You changed the files; and 99 | 100 | (c) You must retain, in the Source form of any Derivative Works 101 | that You distribute, all copyright, patent, trademark, and 102 | attribution notices from the Source form of the Work, 103 | excluding those notices that do not pertain to any part of 104 | the Derivative Works; and 105 | 106 | (d) If the Work includes a "NOTICE" text file as part of its 107 | distribution, then any Derivative Works that You distribute must 108 | include a readable copy of the attribution notices contained 109 | within such NOTICE file, excluding those notices that do not 110 | pertain to any part of the Derivative Works, in at least one 111 | of the following places: within a NOTICE text file distributed 112 | as part of the Derivative Works; within the Source form or 113 | documentation, if provided along with the Derivative Works; or, 114 | within a display generated by the Derivative Works, if and 115 | wherever such third-party notices normally appear. The contents 116 | of the NOTICE file are for informational purposes only and 117 | do not modify the License. You may add Your own attribution 118 | notices within Derivative Works that You distribute, alongside 119 | or as an addendum to the NOTICE text from the Work, provided 120 | that such additional attribution notices cannot be construed 121 | as modifying the License. 122 | 123 | You may add Your own copyright statement to Your modifications and 124 | may provide additional or different license terms and conditions 125 | for use, reproduction, or distribution of Your modifications, or 126 | for any such Derivative Works as a whole, provided Your use, 127 | reproduction, and distribution of the Work otherwise complies with 128 | the conditions stated in this License. 129 | 130 | 5. Submission of Contributions. Unless You explicitly state otherwise, 131 | any Contribution intentionally submitted for inclusion in the Work 132 | by You to the Licensor shall be under the terms and conditions of 133 | this License, without any additional terms or conditions. 134 | Notwithstanding the above, nothing herein shall supersede or modify 135 | the terms of any separate license agreement you may have executed 136 | with Licensor regarding such Contributions. 137 | 138 | 6. Trademarks. This License does not grant permission to use the trade 139 | names, trademarks, service marks, or product names of the Licensor, 140 | except as required for reasonable and customary use in describing the 141 | origin of the Work and reproducing the content of the NOTICE file. 142 | 143 | 7. Disclaimer of Warranty. Unless required by applicable law or 144 | agreed to in writing, Licensor provides the Work (and each 145 | Contributor provides its Contributions) on an "AS IS" BASIS, 146 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 147 | implied, including, without limitation, any warranties or conditions 148 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 149 | PARTICULAR PURPOSE. You are solely responsible for determining the 150 | appropriateness of using or redistributing the Work and assume any 151 | risks associated with Your exercise of permissions under this License. 152 | 153 | 8. Limitation of Liability. In no event and under no legal theory, 154 | whether in tort (including negligence), contract, or otherwise, 155 | unless required by applicable law (such as deliberate and grossly 156 | negligent acts) or agreed to in writing, shall any Contributor be 157 | liable to You for damages, including any direct, indirect, special, 158 | incidental, or consequential damages of any character arising as a 159 | result of this License or out of the use or inability to use the 160 | Work (including but not limited to damages for loss of goodwill, 161 | work stoppage, computer failure or malfunction, or any and all 162 | other commercial damages or losses), even if such Contributor 163 | has been advised of the possibility of such damages. 164 | 165 | 9. Accepting Warranty or Additional Liability. While redistributing 166 | the Work or Derivative Works thereof, You may choose to offer, 167 | and charge a fee for, acceptance of support, warranty, indemnity, 168 | or other liability obligations and/or rights consistent with this 169 | License. However, in accepting such obligations, You may act only 170 | on Your own behalf and on Your sole responsibility, not on behalf 171 | of any other Contributor, and only if You agree to indemnify, 172 | defend, and hold each Contributor harmless for any liability 173 | incurred by, or claims asserted against, such Contributor by reason 174 | of your accepting any such warranty or additional liability. 175 | 176 | END OF TERMS AND CONDITIONS 177 | 178 | APPENDIX: How to apply the Apache License to your work. 179 | 180 | To apply the Apache License to your work, attach the following 181 | boilerplate notice, with the fields enclosed by brackets "[]" 182 | replaced with your own identifying information. (Don't include 183 | the brackets!) The text should be enclosed in the appropriate 184 | comment syntax for the file format. We also recommend that a 185 | file or class name and description of purpose be included on the 186 | same "printed page" as the copyright notice for easier 187 | identification within third-party archives. 188 | 189 | Copyright 2019 Giorgos Kordopatis-Zilos. All rights reserved. 190 | 191 | Licensed under the Apache License, Version 2.0 (the "License"); 192 | you may not use this file except in compliance with the License. 193 | You may obtain a copy of the License at 194 | 195 | http://www.apache.org/licenses/LICENSE-2.0 196 | 197 | Unless required by applicable law or agreed to in writing, software 198 | distributed under the License is distributed on an "AS IS" BASIS, 199 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 200 | See the License for the specific language governing permissions and 201 | limitations under the License. 202 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # ViSiL: Fine-grained Spatio-Temporal Video Similarity Learning 2 | This repository contains the Tensorflow implementation of the paper 3 | [ViSiL: Fine-grained Spatio-Temporal Video Similarity Learning](http://openaccess.thecvf.com/content_ICCV_2019/papers/Kordopatis-Zilos_ViSiL_Fine-Grained_Spatio-Temporal_Video_Similarity_Learning_ICCV_2019_paper.pdf). 4 | It provides code for the calculation of similarities between the query and database videos given by the user. 5 | Also, it contains an evaluation script to reproduce the results of the paper. The video similarity calculation 6 | is achieved by applying a frame-to-frame function that respects the spatial within-frame structure of videos and 7 | a learned video-to-video similarity function that also considers the temporal structure of videos. 8 | 9 | The PyTorch implementation of ViSiL can be found [here](https://github.com/MKLab-ITI/visil/tree/pytorch) 10 | 11 | 12 | 13 | ## Prerequisites 14 | * Python 3 15 | * Tensorflow 1.xx (tested with 1.8-1.15) 16 | 17 | ## Getting started 18 | 19 | ### Installation 20 | 21 | * Clone this repo: 22 | ```bash 23 | git clone https://github.com/MKLab-ITI/visil 24 | cd visil 25 | ``` 26 | * You can install all the dependencies by 27 | ```bash 28 | pip install -r requirements.txt 29 | ``` 30 | 31 | * Download and unzip the pretrained model: 32 | ```bash 33 | wget http://ndd.iti.gr/visil/ckpt.zip 34 | unzip ckpt.zip 35 | ``` 36 | 37 | * If you want to use I3D as backbone network (used for AVR in the paper), then install the following packages: 38 | ```bash 39 | # For tensoflow version >= 1.14 40 | pip install tensorflow-probability==0.7 dm-sonnet==1.25 41 | 42 | # For tensoflow version < 1.14 43 | pip install tensorflow-probability==0.6 dm-sonnet==1.23 44 | ``` 45 | 46 | ### Video similarity calculation 47 | * Create a file that contains the query videos. 48 | Each line of the file have to contain a video id and a path to the corresponding video file, 49 | separated by a tab character (\\t). Example: 50 | 51 | wrC_Uqk3juY queries/wrC_Uqk3juY.mp4 52 | k_NT43aJ_Jw queries/k_NT43aJ_Jw.mp4 53 | 2n30dbPBNKE queries/2n30dbPBNKE.mp4 54 | ... 55 | 56 | 57 | * Create a file with the same format for the database videos. 58 | 59 | * Run the following command to calculate the similarity between all the query and database videos 60 | ```bash 61 | python calculate_similarity.py --query_file queries.txt --database_file database.txt --model_dir model/ 62 | ``` 63 | 64 | * For faster processing, you can load the query videos to the GPU memory by adding the flag ```--load_queries``` 65 | ```bash 66 | python calculate_similarity.py --query_file queries.txt --database_file database.txt --model_dir model/ --load_queries 67 | ``` 68 | 69 | * The calculated similarities are stored to the file given to the ```--output_file```. The file is in JSON format and 70 | contains a dictionary with every query id as keys, and another dictionary that contains the similarities of the dataset 71 | videos to the corresponding queries as values. See the example below 72 | ```bash 73 | { 74 | "wrC_Uqk3juY": { 75 | "KQh6RCW_nAo": 0.716, 76 | "0q82oQa3upE": 0.300, 77 | ...}, 78 | "k_NT43aJ_Jw": { 79 | "-KuR8y1gjJQ": 1.0, 80 | "Xb19O5Iur44": 0.417, 81 | ...}, 82 | .... 83 | } 84 | ``` 85 | ``` 86 | 87 | * Add flag `--help` to display the detailed description for the arguments of the similarity calculation script 88 | 89 | ``` 90 | -q, --query_file QUERY_FILE Path to file that contains the query videos 91 | -d, --database_file DATABASE_FILE Path to file that contains the database videos 92 | -o, --output_file OUTPUT_FILE Name of the output file. Default: "results.json" 93 | --network NETWORK Backbone network used for feature extraction. 94 | Options: "resnet" or "i3d". Default: "resnet" 95 | --model_dir MODEL_DIR Path to the directory of the pretrained model. 96 | Default: "ckpt/resnet" 97 | -s, --similarity_function SIMILARITY_FUNCTION Function that will be used to calculate the 98 | similarity between query-candidate frames and 99 | videos.Options: "chamfer" or "symmetric_chamfer". 100 | Default: "chamfer" 101 | --batch_sz BATCH_SZ Number of frames contained in each batch during 102 | feature extraction. Default: 128 103 | --gpu_id GPU_ID Id of the GPU used. Default: 0 104 | -l, --load_queries Flag that indicates that the queries will be loaded to 105 | the GPU memory. 106 | --threads THREADS Number of threads used for video loading. Default: 8 107 | ``` 108 | 109 | ### Evaluation 110 | * We also provide code to reproduce the experiments in the paper. 111 | 112 | * First, download the videos of the dataset you want. The supported options are: 113 | * [CC_WEB_VIDEO](http://vireo.cs.cityu.edu.hk/webvideo/) - Near-Duplicate Video Retrieval 114 | * [FIVR-5K, FIVR-200K](http://ndd.iti.gr/fivr/) - Fine-grained Incident Video Retrieval 115 | * [EVVE](http://pascal.inrialpes.fr/data/evve/) - Event-based Video Retrieval 116 | * [ActivityNet](http://activity-net.org/) - Action Video Retrieval 117 | 118 | * Determine the pattern based on the video id that the source videos are stored. For example, 119 | if all dataset videos are stored in a folder with filename the video id and the extension `.mp4`, 120 | then the pattern is `{id}.mp4`. If each dataset video is stored in a different folder based on their 121 | video id with filename `video.mp4`, then the pattern us `{id}/video.mp4`. 122 | * The code replaces the `{id}` string with the id of the videos in the dataset 123 | * Also, it support supports Unix style pathname pattern expansion. For example, if video files have 124 | various extension, then the pattern can be e.g. `{id}/video.*` 125 | * For FIVR-200K, EVVE, ActivityNet, the Youtube ids are considered as the video ids 126 | * For CC_WEB_VIDEO, video ids derives from the number of the query set that the video belongs to, 127 | and the basename of the file. In particular, the video ids are in form `/`, e.g. `1/1_1_Y` 128 | 129 | * Run the `evaluation.py` by providing the name of the evaluation dataset, the path to video files, 130 | the pattern that the videos are stored 131 | ``` 132 | python evaluation.py --dataset FIVR-5K --video_dir /path/to/videos/ --pattern {id}/video.* --load_queries 133 | ``` 134 | 135 | ### Use ViSiL in your Python code 136 | 137 | Here is a toy example to run ViSiL on any data. 138 | 139 | ```python 140 | from model.visil import ViSiL 141 | from datasets import load_video 142 | 143 | # Load the two videos from the video files 144 | query_video = load_video('/path/to/query/video') 145 | target_video = load_video('/path/to/target/video') 146 | 147 | # Initialize ViSiL model and load pre-trained weights 148 | model = ViSiL('ckpt/resnet/') 149 | 150 | # Extract features of the two videos 151 | query_features = model.extract_features(query_video, batch_sz=32) 152 | target_features = model.extract_features(target_video, batch_sz=32) 153 | 154 | # Calculate similarity between the two videos 155 | similarity = model.calculate_video_similarity(query_features, target_features) 156 | ``` 157 | 158 | ## Docker 159 | Thanks to [@theycallmeloki](https://github.com/theycallmeloki) for providing a 160 | [Dockerfile](https://github.com/MKLab-ITI/visil/blob/master/Dockerfile) to setup a docker container for the repo. 161 | 162 | * First build a docker image based on the Dockerfile 163 | ```bash 164 | docker build -t visil:latest . 165 | ``` 166 | 167 | * Start a docker container based on the created docker image 168 | ```bash 169 | docker run -it --gpus all --name ViSiL visil:latest 170 | ``` 171 | 172 | ## Visualization 173 | To visualize similarity matrices and the ViSiL outputs, you may use 174 | [this Colab notebook](https://colab.research.google.com/drive/1XwkQpXrpyr7jjq3xCL7anASBjNfNCnkn). 175 | 176 | ## Citation 177 | If you use this code for your research, please consider citing our paper: 178 | ```bibtex 179 | @inproceedings{kordopatis2019visil, 180 | title={{ViSiL}: Fine-grained Spatio-Temporal Video Similarity Learning}, 181 | author={Kordopatis-Zilos, Giorgos and Papadopoulos, Symeon and Patras, Ioannis and Kompatsiaris, Ioannis}, 182 | booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision}, 183 | year={2019} 184 | } 185 | ``` 186 | ## Related Projects 187 | 188 | **[DnS](https://github.com/mever-team/distill-and-select)** - improved performance and better computational efficiency 189 | 190 | **[FIVR-200K](https://github.com/MKLab-ITI/FIVR-200K)** - download our FIVR-200K dataset 191 | 192 | ## License 193 | This project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details 194 | 195 | ## Contact for further details about the project 196 | 197 | Giorgos Kordopatis-Zilos (georgekordopatis@iti.gr)
198 | -------------------------------------------------------------------------------- /calculate_similarity.py: -------------------------------------------------------------------------------- 1 | import json 2 | import argparse 3 | import tensorflow as tf 4 | 5 | from tqdm import tqdm 6 | from model.visil import ViSiL 7 | from datasets import VideoGenerator 8 | 9 | if __name__ == '__main__': 10 | parser = argparse.ArgumentParser() 11 | parser.add_argument('-q', '--query_file', type=str, required=True, 12 | help='Path to file that contains the query videos') 13 | parser.add_argument('-d', '--database_file', type=str, required=True, 14 | help='Path to file that contains the database videos') 15 | parser.add_argument('-o', '--output_file', type=str, default='results.json', 16 | help='Name of the output file. Default: \"results.json\"') 17 | parser.add_argument('-n', '--network', type=str, default='resnet', 18 | help='Backbone network used for feature extraction. ' 19 | 'Options: \"resnet\" or \"i3d\". Default: \"resnet\"') 20 | parser.add_argument('-m', '--model_dir', type=str, default='ckpt/resnet', 21 | help='Path to the directory of the pretrained model. Default: \"ckpt/resnet\"') 22 | parser.add_argument('-s', '--similarity_function', type=str, default='chamfer', 23 | help='Function that will be used to calculate similarity ' 24 | 'between query-target frames and videos.' 25 | 'Options: \"chamfer\" or \"symmetric_chamfer\". Default: \"chamfer\"') 26 | parser.add_argument('-b', '--batch_sz', type=int, default=128, 27 | help='Number of frames contained in each batch during feature extraction. Default: 128') 28 | parser.add_argument('-g', '--gpu_id', type=int, default=0, 29 | help='Id of the GPU used. Default: 0') 30 | parser.add_argument('-l', '--load_queries', action='store_true', 31 | help='Flag that indicates that the queries will be loaded to the GPU memory.') 32 | parser.add_argument('-t', '--threads', type=int, default=8, 33 | help='Number of threads used for video loading. Default: 8') 34 | args = parser.parse_args() 35 | 36 | # Create a video generator for the queries 37 | enqueuer = tf.keras.utils.OrderedEnqueuer(VideoGenerator(args.query_file, all_frames='i3d' in args.network), 38 | use_multiprocessing=True, shuffle=False) 39 | enqueuer.start(workers=args.threads, max_queue_size=args.threads*2) 40 | generator = enqueuer.get() 41 | 42 | # Initialize ViSiL model 43 | model = ViSiL(args.model_dir, net=args.network, 44 | load_queries=args.load_queries, gpu_id=args.gpu_id, 45 | similarity_function=args.similarity_function, 46 | queries_number=len(enqueuer.sequence) if args.load_queries else None) 47 | 48 | # Extract features of the queries 49 | queries, queries_ids = [], [] 50 | pbar = tqdm(range(len(enqueuer.sequence))) 51 | for _ in pbar: 52 | frames, video_id = next(generator) 53 | features = model.extract_features(frames, args.batch_sz) 54 | queries.append(features) 55 | queries_ids.append(video_id) 56 | pbar.set_postfix(query_id=video_id) 57 | enqueuer.stop() 58 | model.set_queries(queries) 59 | 60 | # Create a video generator for the database video 61 | enqueuer = tf.keras.utils.OrderedEnqueuer(VideoGenerator(args.database_file, all_frames='i3d' in args.network), 62 | use_multiprocessing=True, shuffle=False) 63 | enqueuer.start(workers=args.threads, max_queue_size=args.threads*2) 64 | generator = enqueuer.get() 65 | 66 | # Calculate similarities between the queries and the database videos 67 | similarities = dict({query: dict() for query in queries_ids}) 68 | pbar = tqdm(range(len(enqueuer.sequence))) 69 | for _ in pbar: 70 | frames, video_id = next(generator) 71 | if frames.shape[0] > 1: 72 | features = model.extract_features(frames, args.batch_sz) 73 | sims = model.calculate_similarities_to_queries(features) 74 | for i, s in enumerate(sims): 75 | similarities[queries_ids[i]][video_id] = float(s) 76 | pbar.set_postfix(video_id=video_id) 77 | enqueuer.stop() 78 | 79 | # Save similarities to a json file 80 | with open(args.output_file, 'w') as f: 81 | json.dump(similarities, f, indent=1) 82 | -------------------------------------------------------------------------------- /datasets/__init__.py: -------------------------------------------------------------------------------- 1 | import os 2 | import cv2 3 | import glob 4 | import numpy as np 5 | import pickle as pk 6 | import tensorflow as tf 7 | 8 | 9 | def resize_frame(frame, desired_size): 10 | min_size = np.min(frame.shape[:2]) 11 | ratio = desired_size / float(min_size) 12 | frame = cv2.resize(frame, dsize=(0, 0), fx=ratio, fy=ratio, interpolation=cv2.INTER_CUBIC) 13 | return frame 14 | 15 | 16 | def center_crop(frame, desired_size): 17 | old_size = frame.shape[:2] 18 | top = int(np.maximum(0, (old_size[0] - desired_size)/2)) 19 | left = int(np.maximum(0, (old_size[1] - desired_size)/2)) 20 | return frame[top: top+desired_size, left: left+desired_size, :] 21 | 22 | 23 | def load_video(video, all_frames=False): 24 | cv2.setNumThreads(3) 25 | cap = cv2.VideoCapture(video) 26 | fps = cap.get(cv2.CAP_PROP_FPS) 27 | if fps > 144 or fps is None: 28 | fps = 25 29 | frames = [] 30 | count = 0 31 | while cap.isOpened(): 32 | ret = cap.grab() 33 | if int(count % round(fps)) == 0 or all_frames: 34 | ret, frame = cap.retrieve() 35 | if isinstance(frame, np.ndarray): 36 | frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB) 37 | frames.append(center_crop(resize_frame(frame, 256), 256)) 38 | else: 39 | break 40 | count += 1 41 | cap.release() 42 | return np.array(frames) 43 | 44 | 45 | class VideoGenerator(tf.keras.utils.Sequence): 46 | def __init__(self, video_file, all_frames=False): 47 | super(VideoGenerator, self).__init__() 48 | self.videos = np.loadtxt(video_file, dtype=str) 49 | self.videos = np.expand_dims(self.videos, axis=0) if self.videos.ndim == 1 else self.videos 50 | self.all_frames = all_frames 51 | 52 | def __len__(self): 53 | return len(self.videos) 54 | 55 | def __getitem__(self, index): 56 | return load_video(self.videos[index][1], all_frames=self.all_frames), self.videos[index][0] 57 | 58 | 59 | class DatasetGenerator(tf.keras.utils.Sequence): 60 | def __init__(self, rootDir, videos, pattern, all_frames=False): 61 | super(DatasetGenerator, self).__init__() 62 | self.rootDir = rootDir 63 | self.videos = videos 64 | self.pattern = pattern 65 | self.all_frames = all_frames 66 | 67 | def __len__(self): 68 | return len(self.videos) 69 | 70 | def __getitem__(self, index): 71 | video = glob.glob(os.path.join(self.rootDir, self.pattern.replace('{id}', self.videos[index]))) 72 | if not len(video): 73 | print('[WARNING] Video not found: ', self.videos[index]) 74 | return np.array([]), None 75 | else: 76 | return load_video(video[0], all_frames=self.all_frames), self.videos[index] 77 | 78 | 79 | class CC_WEB_VIDEO(object): 80 | 81 | def __init__(self): 82 | with open('datasets/cc_web_video.pickle', 'rb') as f: 83 | dataset = pk.load(f) 84 | self.database = dataset['index'] 85 | self.queries = dataset['queries'] 86 | self.ground_truth = dataset['ground_truth'] 87 | self.excluded = dataset['excluded'] 88 | 89 | def get_queries(self): 90 | return self.queries 91 | 92 | def get_database(self): 93 | return list(map(str, self.database.keys())) 94 | 95 | def calculate_mAP(self, similarities, all_videos=False, clean=False, positive_labels='ESLMV'): 96 | mAP = 0.0 97 | for query_set, labels in enumerate(self.ground_truth): 98 | query_id = self.queries[query_set] 99 | i, ri, s = 0.0, 0.0, 0.0 100 | if query_id in similarities: 101 | res = similarities[query_id] 102 | for video_id in sorted(res.keys(), key=lambda x: res[x], reverse=True): 103 | video = self.database[video_id] 104 | if (all_videos or video in labels) and (not clean or video not in self.excluded[query_set]): 105 | ri += 1 106 | if video in labels and labels[video] in positive_labels: 107 | i += 1.0 108 | s += i / ri 109 | positives = np.sum([1.0 for k, v in labels.items() if 110 | v in positive_labels and (not clean or k not in self.excluded[query_set])]) 111 | mAP += s / positives 112 | return mAP / len(set(self.queries).intersection(similarities.keys())) 113 | 114 | def evaluate(self, similarities, all_db=None): 115 | if all_db is None: 116 | all_db = self.database 117 | 118 | print('=' * 5, 'CC_WEB_VIDEO Dataset', '=' * 5) 119 | not_found = len(set(self.queries) - similarities.keys()) 120 | if not_found > 0: 121 | print('[WARNING] {} queries are missing from the results and will be ignored'.format(not_found)) 122 | print('Queries: {} videos'.format(len(similarities))) 123 | print('Database: {} videos'.format(len(all_db))) 124 | 125 | print('-' * 25) 126 | print('All dataset') 127 | print('CC_WEB mAP: {:.4f}\nCC_WEB* mAP: {:.4f}\n'.format( 128 | self.calculate_mAP(similarities, all_videos=False, clean=False), 129 | self.calculate_mAP(similarities, all_videos=True, clean=False))) 130 | 131 | print('Clean dataset') 132 | print('CC_WEB mAP: {:.4f}\nCC_WEB* mAP: {:.4f}'.format( 133 | self.calculate_mAP(similarities, all_videos=False, clean=True), 134 | self.calculate_mAP(similarities, all_videos=True, clean=True))) 135 | 136 | 137 | class FIVR(object): 138 | 139 | def __init__(self, version='200k'): 140 | self.version = version 141 | with open('datasets/fivr.pickle', 'rb') as f: 142 | dataset = pk.load(f) 143 | self.annotation = dataset['annotation'] 144 | self.queries = dataset[self.version]['queries'] 145 | self.database = dataset[self.version]['database'] 146 | 147 | def get_queries(self): 148 | return self.queries 149 | 150 | def get_database(self): 151 | return list(self.database) 152 | 153 | def calculate_mAP(self, query, res, all_db, relevant_labels): 154 | gt_sets = self.annotation[query] 155 | query_gt = set(sum([gt_sets[label] for label in relevant_labels if label in gt_sets], [])) 156 | query_gt = query_gt.intersection(all_db) 157 | 158 | i, ri, s = 0.0, 0, 0.0 159 | for video in sorted(res.keys(), key=lambda x: res[x], reverse=True): 160 | if video != query and video in all_db: 161 | ri += 1 162 | if video in query_gt: 163 | i += 1.0 164 | s += i / ri 165 | return s / len(query_gt) 166 | 167 | def evaluate(self, similarities, all_db=None): 168 | if all_db is None: 169 | all_db = self.database 170 | 171 | DSVR, CSVR, ISVR = [], [], [] 172 | for query, res in similarities.items(): 173 | if query in self.queries: 174 | DSVR.append(self.calculate_mAP(query, res, all_db, relevant_labels=['ND', 'DS'])) 175 | CSVR.append(self.calculate_mAP(query, res, all_db, relevant_labels=['ND', 'DS', 'CS'])) 176 | ISVR.append(self.calculate_mAP(query, res, all_db, relevant_labels=['ND', 'DS', 'CS', 'IS'])) 177 | 178 | print('=' * 5, 'FIVR-{} Dataset'.format(self.version.upper()), '=' * 5) 179 | not_found = len(set(self.queries) - similarities.keys()) 180 | if not_found > 0: 181 | print('[WARNING] {} queries are missing from the results and will be ignored'.format(not_found)) 182 | 183 | print('Queries: {} videos'.format(len(similarities))) 184 | print('Database: {} videos'.format(len(all_db))) 185 | 186 | print('-' * 16) 187 | print('DSVR mAP: {:.4f}'.format(np.mean(DSVR))) 188 | print('CSVR mAP: {:.4f}'.format(np.mean(CSVR))) 189 | print('ISVR mAP: {:.4f}'.format(np.mean(ISVR))) 190 | 191 | 192 | class EVVE(object): 193 | 194 | def __init__(self): 195 | with open('datasets/evve.pickle', 'rb') as f: 196 | dataset = pk.load(f) 197 | self.events = dataset['annotation'] 198 | self.queries = dataset['queries'] 199 | self.database = dataset['database'] 200 | self.query_to_event = {qname: evname 201 | for evname, (queries, _, _) in self.events.items() 202 | for qname in queries} 203 | 204 | def get_queries(self): 205 | return list(self.queries) 206 | 207 | def get_database(self): 208 | return list(self.database) 209 | 210 | def score_ap_from_ranks_1(self, ranks, nres): 211 | """ Compute the average precision of one search. 212 | ranks = ordered list of ranks of true positives (best rank = 0) 213 | nres = total number of positives in dataset 214 | """ 215 | if nres == 0 or ranks == []: 216 | return 0.0 217 | 218 | ap = 0.0 219 | 220 | # accumulate trapezoids in PR-plot. All have an x-size of: 221 | recall_step = 1.0 / nres 222 | 223 | for ntp, rank in enumerate(ranks): 224 | # ntp = nb of true positives so far 225 | # rank = nb of retrieved items so far 226 | 227 | # y-size on left side of trapezoid: 228 | if rank == 0: 229 | precision_0 = 1.0 230 | else: 231 | precision_0 = ntp / float(rank) 232 | # y-size on right side of trapezoid: 233 | precision_1 = (ntp + 1) / float(rank + 1) 234 | ap += (precision_1 + precision_0) * recall_step / 2.0 235 | return ap 236 | 237 | def evaluate(self, similarities, all_db=None): 238 | results = {e: [] for e in self.events} 239 | if all_db is None: 240 | all_db = set(self.database).union(set(self.queries)) 241 | 242 | not_found = 0 243 | for query in self.queries: 244 | if query not in similarities: 245 | not_found += 1 246 | else: 247 | res = similarities[query] 248 | evname = self.query_to_event[query] 249 | _, pos, null = self.events[evname] 250 | if all_db: 251 | pos = pos.intersection(all_db) 252 | pos_ranks = [] 253 | 254 | ri, n_ext = 0.0, 0.0 255 | for ri, dbname in enumerate(sorted(res.keys(), key=lambda x: res[x], reverse=True)): 256 | if dbname in pos: 257 | pos_ranks.append(ri - n_ext) 258 | if dbname not in all_db: 259 | n_ext += 1 260 | 261 | ap = self.score_ap_from_ranks_1(pos_ranks, len(pos)) 262 | results[evname].append(ap) 263 | 264 | print('=' * 18, 'EVVE Dataset', '=' * 18) 265 | 266 | if not_found > 0: 267 | print('[WARNING] {} queries are missing from the results and will be ignored'.format(not_found)) 268 | print('Queries: {} videos'.format(len(similarities))) 269 | print('Database: {} videos\n'.format(len(all_db - set(self.queries)))) 270 | print('-' * 50) 271 | ap = [] 272 | for evname in sorted(self.events): 273 | queries, _, _ = self.events[evname] 274 | nq = len(queries.intersection(all_db)) 275 | ap.extend(results[evname]) 276 | print('{0: <36} '.format(evname), 'mAP = {:.4f}'.format(np.sum(results[evname]) / nq)) 277 | 278 | print('=' * 50) 279 | print('overall mAP = {:.4f}'.format(np.mean(ap))) 280 | 281 | 282 | class ActivityNet(object): 283 | 284 | def __init__(self): 285 | with open('datasets/activity_net.pickle', 'rb') as f: 286 | self.dataset = pk.load(f) 287 | 288 | def get_queries(self): 289 | return list(map(str, self.dataset.keys())) 290 | 291 | def get_database(self): 292 | return list(map(str, self.dataset.keys())) 293 | 294 | def calculate_AP(self, res, pos): 295 | i, ri, s = 0.0, 0.0, 0.0 296 | for ri, video in enumerate(sorted(res.keys(), key=lambda x: res[x], reverse=True)): 297 | if video in pos: 298 | i += 1.0 299 | s += i / (ri + 1.) 300 | return s / len(pos) 301 | 302 | def evaluate(self, similarities, all_db=None): 303 | mAP, not_found = [], 0 304 | if all_db is None: 305 | all_db = set(self.get_database()) 306 | 307 | for query in self.dataset.keys(): 308 | if query not in similarities: 309 | not_found += 1 310 | else: 311 | pos = self.dataset[query].intersection(all_db) 312 | mAP += [self.calculate_AP(similarities[query], pos)] 313 | 314 | print('=' * 5, 'ActivityNet Dataset', '=' * 5) 315 | if not_found > 0: 316 | print('[WARNING] {} queries are missing from the results and will be ignored'.format(not_found)) 317 | print('Database: {} videos'.format(len(all_db))) 318 | 319 | print('-' * 16) 320 | print('mAP: {:.4f}'.format(np.mean(mAP))) 321 | -------------------------------------------------------------------------------- /datasets/activity_net.pickle: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MKLab-ITI/visil/0971e54fb8325fceb1bc9748ecbfe4c66e5dabd2/datasets/activity_net.pickle -------------------------------------------------------------------------------- /datasets/cc_web_video.pickle: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MKLab-ITI/visil/0971e54fb8325fceb1bc9748ecbfe4c66e5dabd2/datasets/cc_web_video.pickle -------------------------------------------------------------------------------- /datasets/evve.pickle: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MKLab-ITI/visil/0971e54fb8325fceb1bc9748ecbfe4c66e5dabd2/datasets/evve.pickle -------------------------------------------------------------------------------- /datasets/fivr.pickle: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MKLab-ITI/visil/0971e54fb8325fceb1bc9748ecbfe4c66e5dabd2/datasets/fivr.pickle -------------------------------------------------------------------------------- /evaluation.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import tensorflow as tf 3 | 4 | from tqdm import tqdm 5 | from model.visil import ViSiL 6 | from datasets import DatasetGenerator 7 | 8 | 9 | def query_vs_database(model, dataset, args): 10 | # Create a video generator for the queries 11 | enqueuer = tf.keras.utils.OrderedEnqueuer( 12 | DatasetGenerator(args.video_dir, dataset.get_queries(), args.pattern, all_frames='i3d' in args.network), 13 | use_multiprocessing=True, shuffle=False) 14 | enqueuer.start(workers=args.threads, max_queue_size=args.threads * 2) 15 | 16 | # Extract features of the queries 17 | all_db, queries, queries_ids = set(), [], [] 18 | pbar = tqdm(range(len(enqueuer.sequence))) 19 | for _ in pbar: 20 | frames, query_id = next(enqueuer.get()) 21 | if frames.shape[0] > 0: 22 | queries.append(model.extract_features(frames, batch_sz=25 if 'i3d' in args.network else args.batch_sz)) 23 | queries_ids.append(query_id) 24 | all_db.add(query_id) 25 | pbar.set_postfix(query_id=query_id) 26 | enqueuer.stop() 27 | model.set_queries(queries) 28 | 29 | # Create a video generator for the database video 30 | enqueuer = tf.keras.utils.OrderedEnqueuer( 31 | DatasetGenerator(args.video_dir, dataset.get_database(), args.pattern, all_frames='i3d' in args.network), 32 | use_multiprocessing=True, shuffle=False) 33 | enqueuer.start(workers=args.threads, max_queue_size=args.threads * 2) 34 | generator = enqueuer.get() 35 | 36 | # Calculate similarities between the queries and the database videos 37 | similarities = dict({query: dict() for query in queries_ids}) 38 | pbar = tqdm(range(len(enqueuer.sequence))) 39 | for _ in pbar: 40 | frames, video_id = next(generator) 41 | if frames.shape[0] > 1: 42 | features = model.extract_features(frames, batch_sz=25 if 'i3d' in args.network else args.batch_sz) 43 | sims = model.calculate_similarities_to_queries(features) 44 | all_db.add(video_id) 45 | for i, s in enumerate(sims): 46 | similarities[queries_ids[i]][video_id] = float(s) 47 | pbar.set_postfix(video_id=video_id) 48 | enqueuer.stop() 49 | 50 | dataset.evaluate(similarities, all_db) 51 | 52 | 53 | def all_vs_all(model, dataset, args): 54 | # Create a video generator for the dataset video 55 | enqueuer = tf.keras.utils.OrderedEnqueuer( 56 | DatasetGenerator(args.video_dir, dataset.get_queries(), args.pattern, all_frames='i3d' in args.network), 57 | use_multiprocessing=True, shuffle=False) 58 | enqueuer.start(workers=args.threads, max_queue_size=args.threads * 2) 59 | 60 | # Calculate similarities between all videos in the dataset 61 | all_db, similarities, features = set(), dict(), dict() 62 | pbar = tqdm(range(len(enqueuer.sequence))) 63 | for _ in pbar: 64 | frames, q = next(enqueuer.get()) 65 | if frames.shape[0] > 0: 66 | all_db.add(q) 67 | similarities[q] = dict() 68 | feat = model.extract_features(frames, batch_sz=25 if 'i3d' in args.network else args.batch_sz) 69 | for k, v in features.items(): 70 | if 'symmetric' in args.similarity_function: 71 | similarities[q][k] = similarities[k][q] = model.calculate_video_similarity(v, feat) 72 | else: 73 | similarities[k][q] = model.calculate_video_similarity(v, feat) 74 | similarities[q][k] = model.calculate_video_similarity(feat, v) 75 | features[q] = feat 76 | pbar.set_postfix(video_id=q, frames=frames.shape, features=feat.shape) 77 | enqueuer.stop() 78 | 79 | dataset.evaluate(similarities, all_db=all_db) 80 | 81 | 82 | if __name__ == '__main__': 83 | parser = argparse.ArgumentParser() 84 | parser.add_argument('-d', '--dataset', type=str, required=True, 85 | help='Name of evaluation dataset. Options: CC_WEB_ VIDEO, ' 86 | '\"FIVR-200K\", \"FIVR-5K\", \"EVVE\", \"ActivityNet\"') 87 | parser.add_argument('-v', '--video_dir', type=str, required=True, 88 | help='Path to file that contains the database videos') 89 | parser.add_argument('-p', '--pattern', type=str, required=True, 90 | help='Pattern that the videos are stored in the video directory, eg. \"{id}/video.*\" ' 91 | 'where the \"{id}\" is replaced with the video Id. Also, it supports ' 92 | 'Unix style pathname pattern expansion.') 93 | parser.add_argument('-n', '--network', type=str, default='resnet', 94 | help='Backbone network used for feature extraction. ' 95 | 'Options: \"resnet\" or \"i3d\". Default: \"resnet\"') 96 | parser.add_argument('-m', '--model_dir', type=str, default='ckpt/resnet', 97 | help='Path to the directory of the pretrained model. Default: \"ckpt/resnet\"') 98 | parser.add_argument('-s', '--similarity_function', type=str, default='chamfer', 99 | help='Function that will be used to calculate similarity' 100 | 'between query-candidate frames and videos.' 101 | 'Options: \"chamfer\" or \"symmetric_chamfer\". Default: \"chamfer\"') 102 | parser.add_argument('-b', '--batch_sz', type=int, default=128, 103 | help='Number of frames contained in each batch during feature extraction. Default: 128') 104 | parser.add_argument('-g', '--gpu_id', type=int, default=0, 105 | help='Id of the GPU used. Default: 0') 106 | parser.add_argument('-l', '--load_queries', action='store_true', 107 | help='Flag that indicates that the queries will be loaded to the GPU memory.') 108 | parser.add_argument('-t', '--threads', type=int, default=8, 109 | help='Number of threads used for video loading. Default: 8') 110 | args = parser.parse_args() 111 | 112 | if 'CC_WEB' in args.dataset: 113 | from datasets import CC_WEB_VIDEO 114 | dataset = CC_WEB_VIDEO() 115 | eval_function = query_vs_database 116 | elif 'FIVR' in args.dataset: 117 | from datasets import FIVR 118 | dataset = FIVR(version=args.dataset.split('-')[1].lower()) 119 | eval_function = query_vs_database 120 | elif 'EVVE' in args.dataset: 121 | from datasets import EVVE 122 | dataset = EVVE() 123 | eval_function = query_vs_database 124 | elif 'ActivityNet' in args.dataset: 125 | from datasets import ActivityNet 126 | dataset = ActivityNet() 127 | eval_function = all_vs_all 128 | else: 129 | raise Exception('[ERROR] Not supported evaluation dataset. ' 130 | 'Supported options: \"CC_WEB_ VIDEO\", \"FIVR-200K\", \"FIVR-5K\", \"EVVE\", \"ActivityNet\"') 131 | 132 | model = ViSiL(args.model_dir, net=args.network, 133 | load_queries=args.load_queries, gpu_id=args.gpu_id, 134 | similarity_function=args.similarity_function, 135 | queries_number=len(dataset.get_queries()) if args.load_queries else None) 136 | 137 | eval_function(model, dataset, args) 138 | -------------------------------------------------------------------------------- /examples/video1.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MKLab-ITI/visil/0971e54fb8325fceb1bc9748ecbfe4c66e5dabd2/examples/video1.gif -------------------------------------------------------------------------------- /examples/video2.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MKLab-ITI/visil/0971e54fb8325fceb1bc9748ecbfe4c66e5dabd2/examples/video2.gif -------------------------------------------------------------------------------- /model/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MKLab-ITI/visil/0971e54fb8325fceb1bc9748ecbfe4c66e5dabd2/model/__init__.py -------------------------------------------------------------------------------- /model/layers.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import tensorflow as tf 3 | 4 | 5 | class PCA_layer(object): 6 | 7 | def __init__(self, whitening=True, dims=None, net='resnet'): 8 | pca = np.load('ckpt/{}/pca.npz'.format(net)) 9 | with tf.variable_scope('PCA'): 10 | self.mean = tf.get_variable('mean_sift', 11 | initializer=pca['mean'], 12 | dtype=tf.float32, 13 | trainable=False) 14 | 15 | weights = pca['V'][:, :dims] 16 | if whitening: 17 | d = pca['d'][:dims] 18 | D = np.diag(1. / np.sqrt(d)) 19 | weights = np.dot(D, weights.T).T 20 | 21 | self.weights = tf.get_variable('weights', 22 | initializer=weights, 23 | dtype=tf.float32, 24 | trainable=False) 25 | 26 | def __call__(self, logits): 27 | logits = logits - self.mean 28 | logits = tf.tensordot(logits, self.weights, axes=1) 29 | return logits 30 | 31 | 32 | class Attention_layer(object): 33 | 34 | def __init__(self, shape=3840): 35 | with tf.variable_scope('attention_layer'): 36 | self.context_vector = tf.get_variable('context_vector', shape=(shape, 1), 37 | dtype=tf.float32, trainable=False) 38 | 39 | def __call__(self, logits): 40 | weights = tf.tensordot(logits, self.context_vector, axes=1) / 2.0 + 0.5 41 | return tf.multiply(logits, weights), weights 42 | 43 | 44 | class Video_Comparator(object): 45 | 46 | def __init__(self): 47 | self.conv1 = tf.keras.layers.Conv2D(32, [3, 3], activation='relu') 48 | self.mpool1 = tf.keras.layers.MaxPool2D([2, 2], 2) 49 | self.conv2 = tf.keras.layers.Conv2D(64, [3, 3], activation='relu') 50 | self.mpool2 = tf.keras.layers.MaxPool2D([2, 2], 2) 51 | self.conv3 = tf.keras.layers.Conv2D(128, [3, 3], activation='relu') 52 | self.fconv = tf.keras.layers.Conv2D(1, [1, 1]) 53 | 54 | def __call__(self, sim_matrix): 55 | with tf.variable_scope('video_comparator'): 56 | sim = tf.reshape(sim_matrix, (1, tf.shape(sim_matrix)[0], tf.shape(sim_matrix)[1], 1)) 57 | sim = tf.pad(sim, [[0, 0], [1, 1], [1, 1], [0, 0]], 'SYMMETRIC') 58 | sim = self.conv1(sim) 59 | sim = self.mpool1(sim) 60 | sim = tf.pad(sim, [[0, 0], [1, 1], [1, 1], [0, 0]], 'SYMMETRIC') 61 | sim = self.conv2(sim) 62 | sim = self.mpool2(sim) 63 | sim = tf.pad(sim, [[0, 0], [1, 1], [1, 1], [0, 0]], 'SYMMETRIC') 64 | sim = self.conv3(sim) 65 | sim = self.fconv(sim) 66 | sim = tf.clip_by_value(sim, -1.0, 1.0) 67 | sim = tf.squeeze(sim, [0, 3]) 68 | return sim 69 | -------------------------------------------------------------------------------- /model/nets/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MKLab-ITI/visil/0971e54fb8325fceb1bc9748ecbfe4c66e5dabd2/model/nets/__init__.py -------------------------------------------------------------------------------- /model/nets/i3d.py: -------------------------------------------------------------------------------- 1 | # Copyright 2017 Google Inc. 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # https://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | # ============================================================================ 15 | """Inception-v1 Inflated 3D ConvNet used for Kinetics CVPR paper. 16 | 17 | The model is introduced in: 18 | 19 | Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset 20 | Joao Carreira, Andrew Zisserman 21 | https://arxiv.org/pdf/1705.07750v1.pdf. 22 | """ 23 | 24 | from __future__ import absolute_import 25 | from __future__ import division 26 | from __future__ import print_function 27 | 28 | import sonnet as snt 29 | import tensorflow as tf 30 | 31 | 32 | class Unit3D(snt.AbstractModule): 33 | """Basic unit containing Conv3D + BatchNorm + non-linearity.""" 34 | 35 | def __init__(self, output_channels, 36 | kernel_shape=(1, 1, 1), 37 | stride=(1, 1, 1), 38 | activation_fn=tf.nn.relu, 39 | use_batch_norm=True, 40 | use_bias=False, 41 | name='unit_3d'): 42 | """Initializes Unit3D module.""" 43 | super(Unit3D, self).__init__(name=name) 44 | self._output_channels = output_channels 45 | self._kernel_shape = kernel_shape 46 | self._stride = stride 47 | self._use_batch_norm = use_batch_norm 48 | self._activation_fn = activation_fn 49 | self._use_bias = use_bias 50 | 51 | def _build(self, inputs, is_training): 52 | """Connects the module to inputs. 53 | 54 | Args: 55 | inputs: Inputs to the Unit3D component. 56 | is_training: whether to use training mode for snt.BatchNorm (boolean). 57 | 58 | Returns: 59 | Outputs from the module. 60 | """ 61 | net = snt.Conv3D(output_channels=self._output_channels, 62 | kernel_shape=self._kernel_shape, 63 | stride=self._stride, 64 | padding=snt.SAME, 65 | use_bias=self._use_bias)(inputs) 66 | if self._use_batch_norm: 67 | bn = snt.BatchNorm() 68 | net = bn(net, is_training=is_training, test_local_stats=False) 69 | if self._activation_fn is not None: 70 | net = self._activation_fn(net) 71 | return net 72 | 73 | 74 | class InceptionI3d(snt.AbstractModule): 75 | """Inception-v1 I3D architecture. 76 | 77 | The model is introduced in: 78 | 79 | Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset 80 | Joao Carreira, Andrew Zisserman 81 | https://arxiv.org/pdf/1705.07750v1.pdf. 82 | 83 | See also the Inception architecture, introduced in: 84 | 85 | Going deeper with convolutions 86 | Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, 87 | Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, Andrew Rabinovich. 88 | http://arxiv.org/pdf/1409.4842v1.pdf. 89 | """ 90 | 91 | # Endpoints of the model in order. During construction, all the endpoints up 92 | # to a designated `final_endpoint` are returned in a dictionary as the 93 | # second return value. 94 | VALID_ENDPOINTS = ( 95 | 'Conv3d_1a_7x7', 96 | 'MaxPool3d_2a_3x3', 97 | 'Conv3d_2b_1x1', 98 | 'Conv3d_2c_3x3', 99 | 'MaxPool3d_3a_3x3', 100 | 'Mixed_3b', 101 | 'Mixed_3c', 102 | 'MaxPool3d_4a_3x3', 103 | 'Mixed_4b', 104 | 'Mixed_4c', 105 | 'Mixed_4d', 106 | 'Mixed_4e', 107 | 'Mixed_4f', 108 | 'MaxPool3d_5a_2x2', 109 | 'Mixed_5b', 110 | 'Mixed_5c', 111 | 'Logits', 112 | 'Predictions', 113 | ) 114 | 115 | def __init__(self, num_classes=400, spatial_squeeze=True, 116 | final_endpoint='Logits', name='inception_i3d'): 117 | """Initializes I3D model instance. 118 | 119 | Args: 120 | num_classes: The number of outputs in the logit layer (default 400, which 121 | matches the Kinetics dataset). 122 | spatial_squeeze: Whether to squeeze the spatial dimensions for the logits 123 | before returning (default True). 124 | final_endpoint: The model contains many possible endpoints. 125 | `final_endpoint` specifies the last endpoint for the model to be built 126 | up to. In addition to the output at `final_endpoint`, all the outputs 127 | at endpoints up to `final_endpoint` will also be returned, in a 128 | dictionary. `final_endpoint` must be one of 129 | InceptionI3d.VALID_ENDPOINTS (default 'Logits'). 130 | name: A string (optional). The name of this module. 131 | 132 | Raises: 133 | ValueError: if `final_endpoint` is not recognized. 134 | """ 135 | 136 | if final_endpoint not in self.VALID_ENDPOINTS: 137 | raise ValueError('Unknown final endpoint %s' % final_endpoint) 138 | 139 | super(InceptionI3d, self).__init__(name=name) 140 | self._num_classes = num_classes 141 | self._spatial_squeeze = spatial_squeeze 142 | self._final_endpoint = final_endpoint 143 | 144 | def _build(self, inputs, is_training, dropout_keep_prob=1.0): 145 | """Connects the model to inputs. 146 | 147 | Args: 148 | inputs: Inputs to the model, which should have dimensions 149 | `batch_size` x `num_frames` x 224 x 224 x `num_channels`. 150 | is_training: whether to use training mode for snt.BatchNorm (boolean). 151 | dropout_keep_prob: Probability for the tf.nn.dropout layer (float in 152 | [0, 1)). 153 | 154 | Returns: 155 | A tuple consisting of: 156 | 1. Network output at location `self._final_endpoint`. 157 | 2. Dictionary containing all endpoints up to `self._final_endpoint`, 158 | indexed by endpoint name. 159 | 160 | Raises: 161 | ValueError: if `self._final_endpoint` is not recognized. 162 | """ 163 | if self._final_endpoint not in self.VALID_ENDPOINTS: 164 | raise ValueError('Unknown final endpoint %s' % self._final_endpoint) 165 | 166 | net = inputs 167 | end_points = {} 168 | end_point = 'Conv3d_1a_7x7' 169 | net = Unit3D(output_channels=64, kernel_shape=[7, 7, 7], 170 | stride=[2, 2, 2], name=end_point)(net, is_training=is_training) 171 | end_points[end_point] = net 172 | if self._final_endpoint == end_point: return net, end_points 173 | end_point = 'MaxPool3d_2a_3x3' 174 | net = tf.nn.max_pool3d(net, ksize=[1, 1, 3, 3, 1], strides=[1, 1, 2, 2, 1], 175 | padding=snt.SAME, name=end_point) 176 | end_points[end_point] = net 177 | if self._final_endpoint == end_point: return net, end_points 178 | end_point = 'Conv3d_2b_1x1' 179 | net = Unit3D(output_channels=64, kernel_shape=[1, 1, 1], 180 | name=end_point)(net, is_training=is_training) 181 | end_points[end_point] = net 182 | if self._final_endpoint == end_point: return net, end_points 183 | end_point = 'Conv3d_2c_3x3' 184 | net = Unit3D(output_channels=192, kernel_shape=[3, 3, 3], 185 | name=end_point)(net, is_training=is_training) 186 | end_points[end_point] = net 187 | if self._final_endpoint == end_point: return net, end_points 188 | end_point = 'MaxPool3d_3a_3x3' 189 | net = tf.nn.max_pool3d(net, ksize=[1, 1, 3, 3, 1], strides=[1, 1, 2, 2, 1], 190 | padding=snt.SAME, name=end_point) 191 | end_points[end_point] = net 192 | if self._final_endpoint == end_point: return net, end_points 193 | 194 | end_point = 'Mixed_3b' 195 | with tf.variable_scope(end_point): 196 | with tf.variable_scope('Branch_0'): 197 | branch_0 = Unit3D(output_channels=64, kernel_shape=[1, 1, 1], 198 | name='Conv3d_0a_1x1')(net, is_training=is_training) 199 | with tf.variable_scope('Branch_1'): 200 | branch_1 = Unit3D(output_channels=96, kernel_shape=[1, 1, 1], 201 | name='Conv3d_0a_1x1')(net, is_training=is_training) 202 | branch_1 = Unit3D(output_channels=128, kernel_shape=[3, 3, 3], 203 | name='Conv3d_0b_3x3')(branch_1, 204 | is_training=is_training) 205 | with tf.variable_scope('Branch_2'): 206 | branch_2 = Unit3D(output_channels=16, kernel_shape=[1, 1, 1], 207 | name='Conv3d_0a_1x1')(net, is_training=is_training) 208 | branch_2 = Unit3D(output_channels=32, kernel_shape=[3, 3, 3], 209 | name='Conv3d_0b_3x3')(branch_2, 210 | is_training=is_training) 211 | with tf.variable_scope('Branch_3'): 212 | branch_3 = tf.nn.max_pool3d(net, ksize=[1, 3, 3, 3, 1], 213 | strides=[1, 1, 1, 1, 1], padding=snt.SAME, 214 | name='MaxPool3d_0a_3x3') 215 | branch_3 = Unit3D(output_channels=32, kernel_shape=[1, 1, 1], 216 | name='Conv3d_0b_1x1')(branch_3, 217 | is_training=is_training) 218 | 219 | net = tf.concat([branch_0, branch_1, branch_2, branch_3], 4) 220 | end_points[end_point] = net 221 | if self._final_endpoint == end_point: return net, end_points 222 | 223 | end_point = 'Mixed_3c' 224 | with tf.variable_scope(end_point): 225 | with tf.variable_scope('Branch_0'): 226 | branch_0 = Unit3D(output_channels=128, kernel_shape=[1, 1, 1], 227 | name='Conv3d_0a_1x1')(net, is_training=is_training) 228 | with tf.variable_scope('Branch_1'): 229 | branch_1 = Unit3D(output_channels=128, kernel_shape=[1, 1, 1], 230 | name='Conv3d_0a_1x1')(net, is_training=is_training) 231 | branch_1 = Unit3D(output_channels=192, kernel_shape=[3, 3, 3], 232 | name='Conv3d_0b_3x3')(branch_1, 233 | is_training=is_training) 234 | with tf.variable_scope('Branch_2'): 235 | branch_2 = Unit3D(output_channels=32, kernel_shape=[1, 1, 1], 236 | name='Conv3d_0a_1x1')(net, is_training=is_training) 237 | branch_2 = Unit3D(output_channels=96, kernel_shape=[3, 3, 3], 238 | name='Conv3d_0b_3x3')(branch_2, 239 | is_training=is_training) 240 | with tf.variable_scope('Branch_3'): 241 | branch_3 = tf.nn.max_pool3d(net, ksize=[1, 3, 3, 3, 1], 242 | strides=[1, 1, 1, 1, 1], padding=snt.SAME, 243 | name='MaxPool3d_0a_3x3') 244 | branch_3 = Unit3D(output_channels=64, kernel_shape=[1, 1, 1], 245 | name='Conv3d_0b_1x1')(branch_3, 246 | is_training=is_training) 247 | net = tf.concat([branch_0, branch_1, branch_2, branch_3], 4) 248 | end_points[end_point] = net 249 | if self._final_endpoint == end_point: return net, end_points 250 | 251 | end_point = 'MaxPool3d_4a_3x3' 252 | net = tf.nn.max_pool3d(net, ksize=[1, 3, 3, 3, 1], strides=[1, 2, 2, 2, 1], 253 | padding=snt.SAME, name=end_point) 254 | end_points[end_point] = net 255 | if self._final_endpoint == end_point: return net, end_points 256 | 257 | end_point = 'Mixed_4b' 258 | with tf.variable_scope(end_point): 259 | with tf.variable_scope('Branch_0'): 260 | branch_0 = Unit3D(output_channels=192, kernel_shape=[1, 1, 1], 261 | name='Conv3d_0a_1x1')(net, is_training=is_training) 262 | with tf.variable_scope('Branch_1'): 263 | branch_1 = Unit3D(output_channels=96, kernel_shape=[1, 1, 1], 264 | name='Conv3d_0a_1x1')(net, is_training=is_training) 265 | branch_1 = Unit3D(output_channels=208, kernel_shape=[3, 3, 3], 266 | name='Conv3d_0b_3x3')(branch_1, 267 | is_training=is_training) 268 | with tf.variable_scope('Branch_2'): 269 | branch_2 = Unit3D(output_channels=16, kernel_shape=[1, 1, 1], 270 | name='Conv3d_0a_1x1')(net, is_training=is_training) 271 | branch_2 = Unit3D(output_channels=48, kernel_shape=[3, 3, 3], 272 | name='Conv3d_0b_3x3')(branch_2, 273 | is_training=is_training) 274 | with tf.variable_scope('Branch_3'): 275 | branch_3 = tf.nn.max_pool3d(net, ksize=[1, 3, 3, 3, 1], 276 | strides=[1, 1, 1, 1, 1], padding=snt.SAME, 277 | name='MaxPool3d_0a_3x3') 278 | branch_3 = Unit3D(output_channels=64, kernel_shape=[1, 1, 1], 279 | name='Conv3d_0b_1x1')(branch_3, 280 | is_training=is_training) 281 | net = tf.concat([branch_0, branch_1, branch_2, branch_3], 4) 282 | end_points[end_point] = net 283 | if self._final_endpoint == end_point: return net, end_points 284 | 285 | end_point = 'Mixed_4c' 286 | with tf.variable_scope(end_point): 287 | with tf.variable_scope('Branch_0'): 288 | branch_0 = Unit3D(output_channels=160, kernel_shape=[1, 1, 1], 289 | name='Conv3d_0a_1x1')(net, is_training=is_training) 290 | with tf.variable_scope('Branch_1'): 291 | branch_1 = Unit3D(output_channels=112, kernel_shape=[1, 1, 1], 292 | name='Conv3d_0a_1x1')(net, is_training=is_training) 293 | branch_1 = Unit3D(output_channels=224, kernel_shape=[3, 3, 3], 294 | name='Conv3d_0b_3x3')(branch_1, 295 | is_training=is_training) 296 | with tf.variable_scope('Branch_2'): 297 | branch_2 = Unit3D(output_channels=24, kernel_shape=[1, 1, 1], 298 | name='Conv3d_0a_1x1')(net, is_training=is_training) 299 | branch_2 = Unit3D(output_channels=64, kernel_shape=[3, 3, 3], 300 | name='Conv3d_0b_3x3')(branch_2, 301 | is_training=is_training) 302 | with tf.variable_scope('Branch_3'): 303 | branch_3 = tf.nn.max_pool3d(net, ksize=[1, 3, 3, 3, 1], 304 | strides=[1, 1, 1, 1, 1], padding=snt.SAME, 305 | name='MaxPool3d_0a_3x3') 306 | branch_3 = Unit3D(output_channels=64, kernel_shape=[1, 1, 1], 307 | name='Conv3d_0b_1x1')(branch_3, 308 | is_training=is_training) 309 | net = tf.concat([branch_0, branch_1, branch_2, branch_3], 4) 310 | end_points[end_point] = net 311 | if self._final_endpoint == end_point: return net, end_points 312 | 313 | end_point = 'Mixed_4d' 314 | with tf.variable_scope(end_point): 315 | with tf.variable_scope('Branch_0'): 316 | branch_0 = Unit3D(output_channels=128, kernel_shape=[1, 1, 1], 317 | name='Conv3d_0a_1x1')(net, is_training=is_training) 318 | with tf.variable_scope('Branch_1'): 319 | branch_1 = Unit3D(output_channels=128, kernel_shape=[1, 1, 1], 320 | name='Conv3d_0a_1x1')(net, is_training=is_training) 321 | branch_1 = Unit3D(output_channels=256, kernel_shape=[3, 3, 3], 322 | name='Conv3d_0b_3x3')(branch_1, 323 | is_training=is_training) 324 | with tf.variable_scope('Branch_2'): 325 | branch_2 = Unit3D(output_channels=24, kernel_shape=[1, 1, 1], 326 | name='Conv3d_0a_1x1')(net, is_training=is_training) 327 | branch_2 = Unit3D(output_channels=64, kernel_shape=[3, 3, 3], 328 | name='Conv3d_0b_3x3')(branch_2, 329 | is_training=is_training) 330 | with tf.variable_scope('Branch_3'): 331 | branch_3 = tf.nn.max_pool3d(net, ksize=[1, 3, 3, 3, 1], 332 | strides=[1, 1, 1, 1, 1], padding=snt.SAME, 333 | name='MaxPool3d_0a_3x3') 334 | branch_3 = Unit3D(output_channels=64, kernel_shape=[1, 1, 1], 335 | name='Conv3d_0b_1x1')(branch_3, 336 | is_training=is_training) 337 | net = tf.concat([branch_0, branch_1, branch_2, branch_3], 4) 338 | end_points[end_point] = net 339 | if self._final_endpoint == end_point: return net, end_points 340 | 341 | end_point = 'Mixed_4e' 342 | with tf.variable_scope(end_point): 343 | with tf.variable_scope('Branch_0'): 344 | branch_0 = Unit3D(output_channels=112, kernel_shape=[1, 1, 1], 345 | name='Conv3d_0a_1x1')(net, is_training=is_training) 346 | with tf.variable_scope('Branch_1'): 347 | branch_1 = Unit3D(output_channels=144, kernel_shape=[1, 1, 1], 348 | name='Conv3d_0a_1x1')(net, is_training=is_training) 349 | branch_1 = Unit3D(output_channels=288, kernel_shape=[3, 3, 3], 350 | name='Conv3d_0b_3x3')(branch_1, 351 | is_training=is_training) 352 | with tf.variable_scope('Branch_2'): 353 | branch_2 = Unit3D(output_channels=32, kernel_shape=[1, 1, 1], 354 | name='Conv3d_0a_1x1')(net, is_training=is_training) 355 | branch_2 = Unit3D(output_channels=64, kernel_shape=[3, 3, 3], 356 | name='Conv3d_0b_3x3')(branch_2, 357 | is_training=is_training) 358 | with tf.variable_scope('Branch_3'): 359 | branch_3 = tf.nn.max_pool3d(net, ksize=[1, 3, 3, 3, 1], 360 | strides=[1, 1, 1, 1, 1], padding=snt.SAME, 361 | name='MaxPool3d_0a_3x3') 362 | branch_3 = Unit3D(output_channels=64, kernel_shape=[1, 1, 1], 363 | name='Conv3d_0b_1x1')(branch_3, 364 | is_training=is_training) 365 | net = tf.concat([branch_0, branch_1, branch_2, branch_3], 4) 366 | end_points[end_point] = net 367 | if self._final_endpoint == end_point: return net, end_points 368 | 369 | end_point = 'Mixed_4f' 370 | with tf.variable_scope(end_point): 371 | with tf.variable_scope('Branch_0'): 372 | branch_0 = Unit3D(output_channels=256, kernel_shape=[1, 1, 1], 373 | name='Conv3d_0a_1x1')(net, is_training=is_training) 374 | with tf.variable_scope('Branch_1'): 375 | branch_1 = Unit3D(output_channels=160, kernel_shape=[1, 1, 1], 376 | name='Conv3d_0a_1x1')(net, is_training=is_training) 377 | branch_1 = Unit3D(output_channels=320, kernel_shape=[3, 3, 3], 378 | name='Conv3d_0b_3x3')(branch_1, 379 | is_training=is_training) 380 | with tf.variable_scope('Branch_2'): 381 | branch_2 = Unit3D(output_channels=32, kernel_shape=[1, 1, 1], 382 | name='Conv3d_0a_1x1')(net, is_training=is_training) 383 | branch_2 = Unit3D(output_channels=128, kernel_shape=[3, 3, 3], 384 | name='Conv3d_0b_3x3')(branch_2, 385 | is_training=is_training) 386 | with tf.variable_scope('Branch_3'): 387 | branch_3 = tf.nn.max_pool3d(net, ksize=[1, 3, 3, 3, 1], 388 | strides=[1, 1, 1, 1, 1], padding=snt.SAME, 389 | name='MaxPool3d_0a_3x3') 390 | branch_3 = Unit3D(output_channels=128, kernel_shape=[1, 1, 1], 391 | name='Conv3d_0b_1x1')(branch_3, 392 | is_training=is_training) 393 | net = tf.concat([branch_0, branch_1, branch_2, branch_3], 4) 394 | end_points[end_point] = net 395 | if self._final_endpoint == end_point: return net, end_points 396 | 397 | end_point = 'MaxPool3d_5a_2x2' 398 | net = tf.nn.max_pool3d(net, ksize=[1, 2, 2, 2, 1], strides=[1, 2, 2, 2, 1], 399 | padding=snt.SAME, name=end_point) 400 | end_points[end_point] = net 401 | if self._final_endpoint == end_point: return net, end_points 402 | 403 | end_point = 'Mixed_5b' 404 | with tf.variable_scope(end_point): 405 | with tf.variable_scope('Branch_0'): 406 | branch_0 = Unit3D(output_channels=256, kernel_shape=[1, 1, 1], 407 | name='Conv3d_0a_1x1')(net, is_training=is_training) 408 | with tf.variable_scope('Branch_1'): 409 | branch_1 = Unit3D(output_channels=160, kernel_shape=[1, 1, 1], 410 | name='Conv3d_0a_1x1')(net, is_training=is_training) 411 | branch_1 = Unit3D(output_channels=320, kernel_shape=[3, 3, 3], 412 | name='Conv3d_0b_3x3')(branch_1, 413 | is_training=is_training) 414 | with tf.variable_scope('Branch_2'): 415 | branch_2 = Unit3D(output_channels=32, kernel_shape=[1, 1, 1], 416 | name='Conv3d_0a_1x1')(net, is_training=is_training) 417 | branch_2 = Unit3D(output_channels=128, kernel_shape=[3, 3, 3], 418 | name='Conv3d_0a_3x3')(branch_2, 419 | is_training=is_training) 420 | with tf.variable_scope('Branch_3'): 421 | branch_3 = tf.nn.max_pool3d(net, ksize=[1, 3, 3, 3, 1], 422 | strides=[1, 1, 1, 1, 1], padding=snt.SAME, 423 | name='MaxPool3d_0a_3x3') 424 | branch_3 = Unit3D(output_channels=128, kernel_shape=[1, 1, 1], 425 | name='Conv3d_0b_1x1')(branch_3, 426 | is_training=is_training) 427 | net = tf.concat([branch_0, branch_1, branch_2, branch_3], 4) 428 | end_points[end_point] = net 429 | if self._final_endpoint == end_point: return net, end_points 430 | 431 | end_point = 'Mixed_5c' 432 | with tf.variable_scope(end_point): 433 | with tf.variable_scope('Branch_0'): 434 | branch_0 = Unit3D(output_channels=384, kernel_shape=[1, 1, 1], 435 | name='Conv3d_0a_1x1')(net, is_training=is_training) 436 | with tf.variable_scope('Branch_1'): 437 | branch_1 = Unit3D(output_channels=192, kernel_shape=[1, 1, 1], 438 | name='Conv3d_0a_1x1')(net, is_training=is_training) 439 | branch_1 = Unit3D(output_channels=384, kernel_shape=[3, 3, 3], 440 | name='Conv3d_0b_3x3')(branch_1, 441 | is_training=is_training) 442 | with tf.variable_scope('Branch_2'): 443 | branch_2 = Unit3D(output_channels=48, kernel_shape=[1, 1, 1], 444 | name='Conv3d_0a_1x1')(net, is_training=is_training) 445 | branch_2 = Unit3D(output_channels=128, kernel_shape=[3, 3, 3], 446 | name='Conv3d_0b_3x3')(branch_2, 447 | is_training=is_training) 448 | with tf.variable_scope('Branch_3'): 449 | branch_3 = tf.nn.max_pool3d(net, ksize=[1, 3, 3, 3, 1], 450 | strides=[1, 1, 1, 1, 1], padding=snt.SAME, 451 | name='MaxPool3d_0a_3x3') 452 | branch_3 = Unit3D(output_channels=128, kernel_shape=[1, 1, 1], 453 | name='Conv3d_0b_1x1')(branch_3, 454 | is_training=is_training) 455 | net = tf.concat([branch_0, branch_1, branch_2, branch_3], 4) 456 | end_points[end_point] = net 457 | if self._final_endpoint == end_point: return net, end_points 458 | 459 | end_point = 'Logits' 460 | with tf.variable_scope(end_point): 461 | net = tf.nn.avg_pool3d(net, ksize=[1, 2, 7, 7, 1], 462 | strides=[1, 1, 1, 1, 1], padding=snt.VALID) 463 | net = tf.nn.dropout(net, dropout_keep_prob) 464 | logits = Unit3D(output_channels=self._num_classes, 465 | kernel_shape=[1, 1, 1], 466 | activation_fn=None, 467 | use_batch_norm=False, 468 | use_bias=True, 469 | name='Conv3d_0c_1x1')(net, is_training=is_training) 470 | if self._spatial_squeeze: 471 | logits = tf.squeeze(logits, [2, 3], name='SpatialSqueeze') 472 | 473 | if self._final_endpoint == end_point: return logits, end_points 474 | averaged_logits = tf.reduce_mean(logits, axis=1) 475 | end_points[end_point] = averaged_logits 476 | 477 | end_point = 'Predictions' 478 | predictions = tf.nn.softmax(averaged_logits) 479 | end_points[end_point] = predictions 480 | return predictions, end_points 481 | -------------------------------------------------------------------------------- /model/nets/resnet_utils.py: -------------------------------------------------------------------------------- 1 | # Copyright 2016 The TensorFlow Authors. All Rights Reserved. 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # http://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | # ============================================================================== 15 | """Contains building blocks for various versions of Residual Networks. 16 | 17 | Residual networks (ResNets) were proposed in: 18 | Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun 19 | Deep Residual Learning for Image Recognition. arXiv:1512.03385, 2015 20 | 21 | More variants were introduced in: 22 | Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun 23 | Identity Mappings in Deep Residual Networks. arXiv: 1603.05027, 2016 24 | 25 | We can obtain different ResNet variants by changing the network depth, width, 26 | and form of residual unit. This module implements the infrastructure for 27 | building them. Concrete ResNet units and full ResNet networks are implemented in 28 | the accompanying resnet_v1.py and resnet_v2.py modules. 29 | 30 | Compared to https://github.com/KaimingHe/deep-residual-networks, in the current 31 | implementation we subsample the output activations in the last residual unit of 32 | each block, instead of subsampling the input activations in the first residual 33 | unit of each block. The two implementations give identical results but our 34 | implementation is more memory efficient. 35 | """ 36 | from __future__ import absolute_import 37 | from __future__ import division 38 | from __future__ import print_function 39 | 40 | import collections 41 | import tensorflow as tf 42 | 43 | slim = tf.contrib.slim 44 | 45 | 46 | class Block(collections.namedtuple('Block', ['scope', 'unit_fn', 'args'])): 47 | """A named tuple describing a ResNet block. 48 | 49 | Its parts are: 50 | scope: The scope of the `Block`. 51 | unit_fn: The ResNet unit function which takes as input a `Tensor` and 52 | returns another `Tensor` with the output of the ResNet unit. 53 | args: A list of length equal to the number of units in the `Block`. The list 54 | contains one (depth, depth_bottleneck, stride) tuple for each unit in the 55 | block to serve as argument to unit_fn. 56 | """ 57 | 58 | 59 | def subsample(inputs, factor, scope=None): 60 | """Subsamples the input along the spatial dimensions. 61 | 62 | Args: 63 | inputs: A `Tensor` of size [batch, height_in, width_in, channels]. 64 | factor: The subsampling factor. 65 | scope: Optional variable_scope. 66 | 67 | Returns: 68 | output: A `Tensor` of size [batch, height_out, width_out, channels] with the 69 | input, either intact (if factor == 1) or subsampled (if factor > 1). 70 | """ 71 | if factor == 1: 72 | return inputs 73 | else: 74 | return slim.max_pool2d(inputs, [1, 1], stride=factor, scope=scope) 75 | 76 | 77 | def conv2d_same(inputs, num_outputs, kernel_size, stride, rate=1, scope=None): 78 | """Strided 2-D convolution with 'SAME' padding. 79 | 80 | When stride > 1, then we do explicit zero-padding, followed by conv2d with 81 | 'VALID' padding. 82 | 83 | Note that 84 | 85 | net = conv2d_same(inputs, num_outputs, 3, stride=stride) 86 | 87 | is equivalent to 88 | 89 | net = slim.conv2d(inputs, num_outputs, 3, stride=1, padding='SAME') 90 | net = subsample(net, factor=stride) 91 | 92 | whereas 93 | 94 | net = slim.conv2d(inputs, num_outputs, 3, stride=stride, padding='SAME') 95 | 96 | is different when the input's height or width is even, which is why we add the 97 | current function. For more details, see ResnetUtilsTest.testConv2DSameEven(). 98 | 99 | Args: 100 | inputs: A 4-D tensor of size [batch, height_in, width_in, channels]. 101 | num_outputs: An integer, the number of output filters. 102 | kernel_size: An int with the kernel_size of the filters. 103 | stride: An integer, the output stride. 104 | rate: An integer, rate for atrous convolution. 105 | scope: Scope. 106 | 107 | Returns: 108 | output: A 4-D tensor of size [batch, height_out, width_out, channels] with 109 | the convolution output. 110 | """ 111 | if stride == 1: 112 | return slim.conv2d(inputs, num_outputs, kernel_size, stride=1, rate=rate, 113 | padding='SAME', scope=scope) 114 | else: 115 | kernel_size_effective = kernel_size + (kernel_size - 1) * (rate - 1) 116 | pad_total = kernel_size_effective - 1 117 | pad_beg = pad_total // 2 118 | pad_end = pad_total - pad_beg 119 | inputs = tf.pad(inputs, 120 | [[0, 0], [pad_beg, pad_end], [pad_beg, pad_end], [0, 0]]) 121 | return slim.conv2d(inputs, num_outputs, kernel_size, stride=stride, 122 | rate=rate, padding='VALID', scope=scope) 123 | 124 | 125 | @slim.add_arg_scope 126 | def stack_blocks_dense(net, blocks, output_stride=None, 127 | outputs_collections=None): 128 | """Stacks ResNet `Blocks` and controls output feature density. 129 | 130 | First, this function creates scopes for the ResNet in the form of 131 | 'block_name/unit_1', 'block_name/unit_2', etc. 132 | 133 | Second, this function allows the user to explicitly control the ResNet 134 | output_stride, which is the ratio of the input to output spatial resolution. 135 | This is useful for dense prediction tasks such as semantic segmentation or 136 | object detection. 137 | 138 | Most ResNets consist of 4 ResNet blocks and subsample the activations by a 139 | factor of 2 when transitioning between consecutive ResNet blocks. This results 140 | to a nominal ResNet output_stride equal to 8. If we set the output_stride to 141 | half the nominal network stride (e.g., output_stride=4), then we compute 142 | responses twice. 143 | 144 | Control of the output feature density is implemented by atrous convolution. 145 | 146 | Args: 147 | net: A `Tensor` of size [batch, height, width, channels]. 148 | blocks: A list of length equal to the number of ResNet `Blocks`. Each 149 | element is a ResNet `Block` object describing the units in the `Block`. 150 | output_stride: If `None`, then the output will be computed at the nominal 151 | network stride. If output_stride is not `None`, it specifies the requested 152 | ratio of input to output spatial resolution, which needs to be equal to 153 | the product of unit strides from the start up to some level of the ResNet. 154 | For example, if the ResNet employs units with strides 1, 2, 1, 3, 4, 1, 155 | then valid values for the output_stride are 1, 2, 6, 24 or None (which 156 | is equivalent to output_stride=24). 157 | outputs_collections: Collection to add the ResNet block outputs. 158 | 159 | Returns: 160 | net: Output tensor with stride equal to the specified output_stride. 161 | 162 | Raises: 163 | ValueError: If the target output_stride is not valid. 164 | """ 165 | # The current_stride variable keeps track of the effective stride of the 166 | # activations. This allows us to invoke atrous convolution whenever applying 167 | # the next residual unit would result in the activations having stride larger 168 | # than the target output_stride. 169 | current_stride = 1 170 | 171 | # The atrous convolution rate parameter. 172 | rate = 1 173 | 174 | for block in blocks: 175 | with tf.variable_scope(block.scope, 'block', [net]) as sc: 176 | for i, unit in enumerate(block.args): 177 | if output_stride is not None and current_stride > output_stride: 178 | raise ValueError('The target output_stride cannot be reached.') 179 | 180 | with tf.variable_scope('unit_%d' % (i + 1), values=[net]): 181 | # If we have reached the target output_stride, then we need to employ 182 | # atrous convolution with stride=1 and multiply the atrous rate by the 183 | # current unit's stride for use in subsequent layers. 184 | if output_stride is not None and current_stride == output_stride: 185 | net = block.unit_fn(net, rate=rate, **dict(unit, stride=1)) 186 | rate *= unit.get('stride', 1) 187 | 188 | else: 189 | net = block.unit_fn(net, rate=1, **unit) 190 | current_stride *= unit.get('stride', 1) 191 | net = slim.utils.collect_named_outputs(outputs_collections, sc.name, net) 192 | 193 | if output_stride is not None and current_stride != output_stride: 194 | raise ValueError('The target output_stride cannot be reached.') 195 | 196 | return net 197 | 198 | 199 | def resnet_arg_scope(weight_decay=0.0001, 200 | batch_norm_decay=0.997, 201 | batch_norm_epsilon=1e-5, 202 | batch_norm_scale=True, 203 | activation_fn=tf.nn.relu, 204 | use_batch_norm=True): 205 | """Defines the default ResNet arg scope. 206 | 207 | TODO(gpapan): The batch-normalization related default values above are 208 | appropriate for use in conjunction with the reference ResNet models 209 | released at https://github.com/KaimingHe/deep-residual-networks. When 210 | training ResNets from scratch, they might need to be tuned. 211 | 212 | Args: 213 | weight_decay: The weight decay to use for regularizing the model. 214 | batch_norm_decay: The moving average decay when estimating layer activation 215 | statistics in batch normalization. 216 | batch_norm_epsilon: Small constant to prevent division by zero when 217 | normalizing activations by their variance in batch normalization. 218 | batch_norm_scale: If True, uses an explicit `gamma` multiplier to scale the 219 | activations in the batch normalization layer. 220 | activation_fn: The activation function which is used in ResNet. 221 | use_batch_norm: Whether or not to use batch normalization. 222 | 223 | Returns: 224 | An `arg_scope` to use for the resnet models. 225 | """ 226 | batch_norm_params = { 227 | 'decay': batch_norm_decay, 228 | 'epsilon': batch_norm_epsilon, 229 | 'scale': batch_norm_scale, 230 | 'updates_collections': tf.GraphKeys.UPDATE_OPS, 231 | 'fused': None, # Use fused batch norm if possible. 232 | } 233 | 234 | with slim.arg_scope( 235 | [slim.conv2d], 236 | weights_regularizer=slim.l2_regularizer(weight_decay), 237 | weights_initializer=slim.variance_scaling_initializer(), 238 | activation_fn=activation_fn, 239 | normalizer_fn=slim.batch_norm if use_batch_norm else None, 240 | normalizer_params=batch_norm_params): 241 | with slim.arg_scope([slim.batch_norm], **batch_norm_params): 242 | # The following implies padding='SAME' for pool1, which makes feature 243 | # alignment easier for dense prediction tasks. This is also used in 244 | # https://github.com/facebook/fb.resnet.torch. However the accompanying 245 | # code of 'Deep Residual Learning for Image Recognition' uses 246 | # padding='VALID' for pool1. You can switch to that choice by setting 247 | # slim.arg_scope([slim.max_pool2d], padding='VALID'). 248 | with slim.arg_scope([slim.max_pool2d], padding='SAME') as arg_sc: 249 | return arg_sc 250 | -------------------------------------------------------------------------------- /model/nets/resnet_v1.py: -------------------------------------------------------------------------------- 1 | # Copyright 2016 The TensorFlow Authors. All Rights Reserved. 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # http://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | # ============================================================================== 15 | """Contains definitions for the original form of Residual Networks. 16 | 17 | The 'v1' residual networks (ResNets) implemented in this module were proposed 18 | by: 19 | [1] Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun 20 | Deep Residual Learning for Image Recognition. arXiv:1512.03385 21 | 22 | Other variants were introduced in: 23 | [2] Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun 24 | Identity Mappings in Deep Residual Networks. arXiv: 1603.05027 25 | 26 | The networks defined in this module utilize the bottleneck building block of 27 | [1] with projection shortcuts only for increasing depths. They employ batch 28 | normalization *after* every weight layer. This is the architecture used by 29 | MSRA in the Imagenet and MSCOCO 2016 competition models ResNet-101 and 30 | ResNet-152. See [2; Fig. 1a] for a comparison between the current 'v1' 31 | architecture and the alternative 'v2' architecture of [2] which uses batch 32 | normalization *before* every weight layer in the so-called full pre-activation 33 | units. 34 | 35 | Typical use: 36 | 37 | from tensorflow.contrib.slim.nets import resnet_v1 38 | 39 | ResNet-101 for image classification into 1000 classes: 40 | 41 | # inputs has shape [batch, 224, 224, 3] 42 | with slim.arg_scope(resnet_v1.resnet_arg_scope()): 43 | net, end_points = resnet_v1.resnet_v1_101(inputs, 1000, is_training=False) 44 | 45 | ResNet-101 for semantic segmentation into 21 classes: 46 | 47 | # inputs has shape [batch, 513, 513, 3] 48 | with slim.arg_scope(resnet_v1.resnet_arg_scope()): 49 | net, end_points = resnet_v1.resnet_v1_101(inputs, 50 | 21, 51 | is_training=False, 52 | global_pool=False, 53 | output_stride=16) 54 | """ 55 | from __future__ import absolute_import 56 | from __future__ import division 57 | from __future__ import print_function 58 | 59 | import tensorflow as tf 60 | 61 | from . import resnet_utils 62 | 63 | 64 | resnet_arg_scope = resnet_utils.resnet_arg_scope 65 | slim = tf.contrib.slim 66 | 67 | 68 | @slim.add_arg_scope 69 | def bottleneck(inputs, 70 | depth, 71 | depth_bottleneck, 72 | stride, 73 | rate=1, 74 | outputs_collections=None, 75 | scope=None, 76 | use_bounded_activations=False): 77 | """Bottleneck residual unit variant with BN after convolutions. 78 | 79 | This is the original residual unit proposed in [1]. See Fig. 1(a) of [2] for 80 | its definition. Note that we use here the bottleneck variant which has an 81 | extra bottleneck layer. 82 | 83 | When putting together two consecutive ResNet blocks that use this unit, one 84 | should use stride = 2 in the last unit of the first block. 85 | 86 | Args: 87 | inputs: A tensor of size [batch, height, width, channels]. 88 | depth: The depth of the ResNet unit output. 89 | depth_bottleneck: The depth of the bottleneck layers. 90 | stride: The ResNet unit's stride. Determines the amount of downsampling of 91 | the units output compared to its input. 92 | rate: An integer, rate for atrous convolution. 93 | outputs_collections: Collection to add the ResNet unit output. 94 | scope: Optional variable_scope. 95 | use_bounded_activations: Whether or not to use bounded activations. Bounded 96 | activations better lend themselves to quantized inference. 97 | 98 | Returns: 99 | The ResNet unit's output. 100 | """ 101 | with tf.variable_scope(scope, 'bottleneck_v1', [inputs]) as sc: 102 | depth_in = slim.utils.last_dimension(inputs.get_shape(), min_rank=4) 103 | if depth == depth_in: 104 | shortcut = resnet_utils.subsample(inputs, stride, 'shortcut') 105 | else: 106 | shortcut = slim.conv2d( 107 | inputs, 108 | depth, [1, 1], 109 | stride=stride, 110 | activation_fn=tf.nn.relu6 if use_bounded_activations else None, 111 | scope='shortcut') 112 | 113 | residual = slim.conv2d(inputs, depth_bottleneck, [1, 1], stride=1, 114 | scope='conv1') 115 | residual = resnet_utils.conv2d_same(residual, depth_bottleneck, 3, stride, 116 | rate=rate, scope='conv2') 117 | residual = slim.conv2d(residual, depth, [1, 1], stride=1, 118 | activation_fn=None, scope='conv3') 119 | 120 | if use_bounded_activations: 121 | # Use clip_by_value to simulate bandpass activation. 122 | residual = tf.clip_by_value(residual, -6.0, 6.0) 123 | output = tf.nn.relu6(shortcut + residual) 124 | else: 125 | output = tf.nn.relu(shortcut + residual) 126 | 127 | return slim.utils.collect_named_outputs(outputs_collections, 128 | sc.name, 129 | output) 130 | 131 | 132 | def resnet_v1(inputs, 133 | blocks, 134 | num_classes=None, 135 | is_training=True, 136 | global_pool=True, 137 | output_stride=None, 138 | include_root_block=True, 139 | spatial_squeeze=True, 140 | reuse=None, 141 | scope=None): 142 | """Generator for v1 ResNet models. 143 | 144 | This function generates a family of ResNet v1 models. See the resnet_v1_*() 145 | methods for specific model instantiations, obtained by selecting different 146 | block instantiations that produce ResNets of various depths. 147 | 148 | Training for image classification on Imagenet is usually done with [224, 224] 149 | inputs, resulting in [7, 7] feature maps at the output of the last ResNet 150 | block for the ResNets defined in [1] that have nominal stride equal to 32. 151 | However, for dense prediction tasks we advise that one uses inputs with 152 | spatial dimensions that are multiples of 32 plus 1, e.g., [321, 321]. In 153 | this case the feature maps at the ResNet output will have spatial shape 154 | [(height - 1) / output_stride + 1, (width - 1) / output_stride + 1] 155 | and corners exactly aligned with the input image corners, which greatly 156 | facilitates alignment of the features to the image. Using as input [225, 225] 157 | images results in [8, 8] feature maps at the output of the last ResNet block. 158 | 159 | For dense prediction tasks, the ResNet needs to run in fully-convolutional 160 | (FCN) mode and global_pool needs to be set to False. The ResNets in [1, 2] all 161 | have nominal stride equal to 32 and a good choice in FCN mode is to use 162 | output_stride=16 in order to increase the density of the computed features at 163 | small computational and memory overhead, cf. http://arxiv.org/abs/1606.00915. 164 | 165 | Args: 166 | inputs: A tensor of size [batch, height_in, width_in, channels]. 167 | blocks: A list of length equal to the number of ResNet blocks. Each element 168 | is a resnet_utils.Block object describing the units in the block. 169 | num_classes: Number of predicted classes for classification tasks. 170 | If 0 or None, we return the features before the logit layer. 171 | is_training: whether batch_norm layers are in training mode. 172 | global_pool: If True, we perform global average pooling before computing the 173 | logits. Set to True for image classification, False for dense prediction. 174 | output_stride: If None, then the output will be computed at the nominal 175 | network stride. If output_stride is not None, it specifies the requested 176 | ratio of input to output spatial resolution. 177 | include_root_block: If True, include the initial convolution followed by 178 | max-pooling, if False excludes it. 179 | spatial_squeeze: if True, logits is of shape [B, C], if false logits is 180 | of shape [B, 1, 1, C], where B is batch_size and C is number of classes. 181 | To use this parameter, the input images must be smaller than 300x300 182 | pixels, in which case the output logit layer does not contain spatial 183 | information and can be removed. 184 | reuse: whether or not the network and its variables should be reused. To be 185 | able to reuse 'scope' must be given. 186 | scope: Optional variable_scope. 187 | 188 | Returns: 189 | net: A rank-4 tensor of size [batch, height_out, width_out, channels_out]. 190 | If global_pool is False, then height_out and width_out are reduced by a 191 | factor of output_stride compared to the respective height_in and width_in, 192 | else both height_out and width_out equal one. If num_classes is 0 or None, 193 | then net is the output of the last ResNet block, potentially after global 194 | average pooling. If num_classes a non-zero integer, net contains the 195 | pre-softmax activations. 196 | end_points: A dictionary from components of the network to the corresponding 197 | activation. 198 | 199 | Raises: 200 | ValueError: If the target output_stride is not valid. 201 | """ 202 | with tf.variable_scope(scope, 'resnet_v1', [inputs], reuse=reuse) as sc: 203 | end_points_collection = sc.original_name_scope + '_end_points' 204 | with slim.arg_scope([slim.conv2d, bottleneck, 205 | resnet_utils.stack_blocks_dense], 206 | outputs_collections=end_points_collection): 207 | with slim.arg_scope([slim.batch_norm], is_training=is_training): 208 | net = inputs 209 | if include_root_block: 210 | if output_stride is not None: 211 | if output_stride % 4 != 0: 212 | raise ValueError('The output_stride needs to be a multiple of 4.') 213 | output_stride /= 4 214 | net = resnet_utils.conv2d_same(net, 64, 7, stride=2, scope='conv1') 215 | net = slim.max_pool2d(net, [3, 3], stride=2, scope='pool1') 216 | net = resnet_utils.stack_blocks_dense(net, blocks, output_stride) 217 | # Convert end_points_collection into a dictionary of end_points. 218 | end_points = slim.utils.convert_collection_to_dict( 219 | end_points_collection) 220 | 221 | if global_pool: 222 | # Global average pooling. 223 | net = tf.reduce_mean(net, [1, 2], name='pool5', keep_dims=True) 224 | end_points['global_pool'] = net 225 | if num_classes: 226 | net = slim.conv2d(net, num_classes, [1, 1], activation_fn=None, 227 | normalizer_fn=None, scope='logits') 228 | end_points[sc.name + '/logits'] = net 229 | if spatial_squeeze: 230 | net = tf.squeeze(net, [1, 2], name='SpatialSqueeze') 231 | end_points[sc.name + '/spatial_squeeze'] = net 232 | end_points['predictions'] = slim.softmax(net, scope='predictions') 233 | return net, end_points 234 | resnet_v1.default_image_size = 224 235 | 236 | 237 | def resnet_v1_block(scope, base_depth, num_units, stride): 238 | """Helper function for creating a resnet_v1 bottleneck block. 239 | 240 | Args: 241 | scope: The scope of the block. 242 | base_depth: The depth of the bottleneck layer for each unit. 243 | num_units: The number of units in the block. 244 | stride: The stride of the block, implemented as a stride in the last unit. 245 | All other units have stride=1. 246 | 247 | Returns: 248 | A resnet_v1 bottleneck block. 249 | """ 250 | return resnet_utils.Block(scope, bottleneck, [{ 251 | 'depth': base_depth * 4, 252 | 'depth_bottleneck': base_depth, 253 | 'stride': 1 254 | }] * (num_units - 1) + [{ 255 | 'depth': base_depth * 4, 256 | 'depth_bottleneck': base_depth, 257 | 'stride': stride 258 | }]) 259 | 260 | 261 | def resnet_v1_50(inputs, 262 | num_classes=None, 263 | is_training=True, 264 | global_pool=True, 265 | output_stride=None, 266 | spatial_squeeze=True, 267 | reuse=None, 268 | scope='resnet_v1_50'): 269 | """ResNet-50 model of [1]. See resnet_v1() for arg and return description.""" 270 | blocks = [ 271 | resnet_v1_block('block1', base_depth=64, num_units=3, stride=2), 272 | resnet_v1_block('block2', base_depth=128, num_units=4, stride=2), 273 | resnet_v1_block('block3', base_depth=256, num_units=6, stride=2), 274 | resnet_v1_block('block4', base_depth=512, num_units=3, stride=1), 275 | ] 276 | return resnet_v1(inputs, blocks, num_classes, is_training, 277 | global_pool=global_pool, output_stride=output_stride, 278 | include_root_block=True, spatial_squeeze=spatial_squeeze, 279 | reuse=reuse, scope=scope) 280 | resnet_v1_50.default_image_size = resnet_v1.default_image_size 281 | 282 | 283 | def resnet_v1_101(inputs, 284 | num_classes=None, 285 | is_training=True, 286 | global_pool=True, 287 | output_stride=None, 288 | spatial_squeeze=True, 289 | reuse=None, 290 | scope='resnet_v1_101'): 291 | """ResNet-101 model of [1]. See resnet_v1() for arg and return description.""" 292 | blocks = [ 293 | resnet_v1_block('block1', base_depth=64, num_units=3, stride=2), 294 | resnet_v1_block('block2', base_depth=128, num_units=4, stride=2), 295 | resnet_v1_block('block3', base_depth=256, num_units=23, stride=2), 296 | resnet_v1_block('block4', base_depth=512, num_units=3, stride=1), 297 | ] 298 | return resnet_v1(inputs, blocks, num_classes, is_training, 299 | global_pool=global_pool, output_stride=output_stride, 300 | include_root_block=True, spatial_squeeze=spatial_squeeze, 301 | reuse=reuse, scope=scope) 302 | resnet_v1_101.default_image_size = resnet_v1.default_image_size 303 | 304 | 305 | def resnet_v1_152(inputs, 306 | num_classes=None, 307 | is_training=True, 308 | global_pool=True, 309 | output_stride=None, 310 | spatial_squeeze=True, 311 | reuse=None, 312 | scope='resnet_v1_152'): 313 | """ResNet-152 model of [1]. See resnet_v1() for arg and return description.""" 314 | blocks = [ 315 | resnet_v1_block('block1', base_depth=64, num_units=3, stride=2), 316 | resnet_v1_block('block2', base_depth=128, num_units=8, stride=2), 317 | resnet_v1_block('block3', base_depth=256, num_units=36, stride=2), 318 | resnet_v1_block('block4', base_depth=512, num_units=3, stride=1), 319 | ] 320 | return resnet_v1(inputs, blocks, num_classes, is_training, 321 | global_pool=global_pool, output_stride=output_stride, 322 | include_root_block=True, spatial_squeeze=spatial_squeeze, 323 | reuse=reuse, scope=scope) 324 | resnet_v1_152.default_image_size = resnet_v1.default_image_size 325 | 326 | 327 | def resnet_v1_200(inputs, 328 | num_classes=None, 329 | is_training=True, 330 | global_pool=True, 331 | output_stride=None, 332 | spatial_squeeze=True, 333 | reuse=None, 334 | scope='resnet_v1_200'): 335 | """ResNet-200 model of [2]. See resnet_v1() for arg and return description.""" 336 | blocks = [ 337 | resnet_v1_block('block1', base_depth=64, num_units=3, stride=2), 338 | resnet_v1_block('block2', base_depth=128, num_units=24, stride=2), 339 | resnet_v1_block('block3', base_depth=256, num_units=36, stride=2), 340 | resnet_v1_block('block4', base_depth=512, num_units=3, stride=1), 341 | ] 342 | return resnet_v1(inputs, blocks, num_classes, is_training, 343 | global_pool=global_pool, output_stride=output_stride, 344 | include_root_block=True, spatial_squeeze=spatial_squeeze, 345 | reuse=reuse, scope=scope) 346 | resnet_v1_200.default_image_size = resnet_v1.default_image_size 347 | -------------------------------------------------------------------------------- /model/nets/vgg_preprocessing.py: -------------------------------------------------------------------------------- 1 | # Copyright 2016 The TensorFlow Authors. All Rights Reserved. 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # http://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | # ============================================================================== 15 | """Provides utilities to preprocess images. 16 | 17 | The preprocessing steps for VGG were introduced in the following technical 18 | report: 19 | 20 | Very Deep Convolutional Networks For Large-Scale Image Recognition 21 | Karen Simonyan and Andrew Zisserman 22 | arXiv technical report, 2015 23 | PDF: http://arxiv.org/pdf/1409.1556.pdf 24 | ILSVRC 2014 Slides: http://www.robots.ox.ac.uk/~karen/pdf/ILSVRC_2014.pdf 25 | CC-BY-4.0 26 | 27 | More information can be obtained from the VGG website: 28 | www.robots.ox.ac.uk/~vgg/research/very_deep/ 29 | """ 30 | 31 | from __future__ import absolute_import 32 | from __future__ import division 33 | from __future__ import print_function 34 | 35 | import tensorflow as tf 36 | 37 | slim = tf.contrib.slim 38 | 39 | _R_MEAN = 123.68 40 | _G_MEAN = 116.78 41 | _B_MEAN = 103.94 42 | 43 | _RESIZE_SIDE_MIN = 256 44 | _RESIZE_SIDE_MAX = 512 45 | 46 | 47 | def _crop(image, offset_height, offset_width, crop_height, crop_width): 48 | """Crops the given image using the provided offsets and sizes. 49 | 50 | Note that the method doesn't assume we know the input image size but it does 51 | assume we know the input image rank. 52 | 53 | Args: 54 | image: an image of shape [height, width, channels]. 55 | offset_height: a scalar tensor indicating the height offset. 56 | offset_width: a scalar tensor indicating the width offset. 57 | crop_height: the height of the cropped image. 58 | crop_width: the width of the cropped image. 59 | 60 | Returns: 61 | the cropped (and resized) image. 62 | 63 | Raises: 64 | InvalidArgumentError: if the rank is not 3 or if the image dimensions are 65 | less than the crop size. 66 | """ 67 | original_shape = tf.shape(image) 68 | 69 | rank_assertion = tf.Assert( 70 | tf.equal(tf.rank(image), 3), 71 | ['Rank of image must be equal to 3.']) 72 | with tf.control_dependencies([rank_assertion]): 73 | cropped_shape = tf.stack([crop_height, crop_width, original_shape[2]]) 74 | 75 | size_assertion = tf.Assert( 76 | tf.logical_and( 77 | tf.greater_equal(original_shape[0], crop_height), 78 | tf.greater_equal(original_shape[1], crop_width)), 79 | ['Crop size greater than the image size.']) 80 | 81 | offsets = tf.to_int32(tf.stack([offset_height, offset_width, 0])) 82 | 83 | # Use tf.slice instead of crop_to_bounding box as it accepts tensors to 84 | # define the crop size. 85 | with tf.control_dependencies([size_assertion]): 86 | image = tf.slice(image, offsets, cropped_shape) 87 | return tf.reshape(image, cropped_shape) 88 | 89 | 90 | def _random_crop(image_list, crop_height, crop_width): 91 | """Crops the given list of images. 92 | 93 | The function applies the same crop to each image in the list. This can be 94 | effectively applied when there are multiple image inputs of the same 95 | dimension such as: 96 | 97 | image, depths, normals = _random_crop([image, depths, normals], 120, 150) 98 | 99 | Args: 100 | image_list: a list of image tensors of the same dimension but possibly 101 | varying channel. 102 | crop_height: the new height. 103 | crop_width: the new width. 104 | 105 | Returns: 106 | the image_list with cropped images. 107 | 108 | Raises: 109 | ValueError: if there are multiple image inputs provided with different size 110 | or the images are smaller than the crop dimensions. 111 | """ 112 | if not image_list: 113 | raise ValueError('Empty image_list.') 114 | 115 | # Compute the rank assertions. 116 | rank_assertions = [] 117 | for i in range(len(image_list)): 118 | image_rank = tf.rank(image_list[i]) 119 | rank_assert = tf.Assert( 120 | tf.equal(image_rank, 3), 121 | ['Wrong rank for tensor %s [expected] [actual]', 122 | image_list[i].name, 3, image_rank]) 123 | rank_assertions.append(rank_assert) 124 | 125 | with tf.control_dependencies([rank_assertions[0]]): 126 | image_shape = tf.shape(image_list[0]) 127 | image_height = image_shape[0] 128 | image_width = image_shape[1] 129 | crop_size_assert = tf.Assert( 130 | tf.logical_and( 131 | tf.greater_equal(image_height, crop_height), 132 | tf.greater_equal(image_width, crop_width)), 133 | ['Crop size greater than the image size.']) 134 | 135 | asserts = [rank_assertions[0], crop_size_assert] 136 | 137 | for i in range(1, len(image_list)): 138 | image = image_list[i] 139 | asserts.append(rank_assertions[i]) 140 | with tf.control_dependencies([rank_assertions[i]]): 141 | shape = tf.shape(image) 142 | height = shape[0] 143 | width = shape[1] 144 | 145 | height_assert = tf.Assert( 146 | tf.equal(height, image_height), 147 | ['Wrong height for tensor %s [expected][actual]', 148 | image.name, height, image_height]) 149 | width_assert = tf.Assert( 150 | tf.equal(width, image_width), 151 | ['Wrong width for tensor %s [expected][actual]', 152 | image.name, width, image_width]) 153 | asserts.extend([height_assert, width_assert]) 154 | 155 | # Create a random bounding box. 156 | # 157 | # Use tf.random_uniform and not numpy.random.rand as doing the former would 158 | # generate random numbers at graph eval time, unlike the latter which 159 | # generates random numbers at graph definition time. 160 | with tf.control_dependencies(asserts): 161 | max_offset_height = tf.reshape(image_height - crop_height + 1, []) 162 | with tf.control_dependencies(asserts): 163 | max_offset_width = tf.reshape(image_width - crop_width + 1, []) 164 | offset_height = tf.random_uniform( 165 | [], maxval=max_offset_height, dtype=tf.int32) 166 | offset_width = tf.random_uniform( 167 | [], maxval=max_offset_width, dtype=tf.int32) 168 | 169 | return [_crop(image, offset_height, offset_width, 170 | crop_height, crop_width) for image in image_list] 171 | 172 | 173 | def _central_crop(image_list, crop_height, crop_width): 174 | """Performs central crops of the given image list. 175 | 176 | Args: 177 | image_list: a list of image tensors of the same dimension but possibly 178 | varying channel. 179 | crop_height: the height of the image following the crop. 180 | crop_width: the width of the image following the crop. 181 | 182 | Returns: 183 | the list of cropped images. 184 | """ 185 | outputs = [] 186 | for image in image_list: 187 | image_height = tf.shape(image)[0] 188 | image_width = tf.shape(image)[1] 189 | 190 | offset_height = (image_height - crop_height) / 2 191 | offset_width = (image_width - crop_width) / 2 192 | 193 | outputs.append(_crop(image, offset_height, offset_width, 194 | crop_height, crop_width)) 195 | return outputs 196 | 197 | 198 | def _mean_image_subtraction(image, means): 199 | """Subtracts the given means from each image channel. 200 | 201 | For example: 202 | means = [123.68, 116.779, 103.939] 203 | image = _mean_image_subtraction(image, means) 204 | 205 | Note that the rank of `image` must be known. 206 | 207 | Args: 208 | image: a tensor of size [height, width, C]. 209 | means: a C-vector of values to subtract from each channel. 210 | 211 | Returns: 212 | the centered image. 213 | 214 | Raises: 215 | ValueError: If the rank of `image` is unknown, if `image` has a rank other 216 | than three or if the number of channels in `image` doesn't match the 217 | number of values in `means`. 218 | """ 219 | if image.get_shape().ndims != 3: 220 | raise ValueError('Input must be of size [height, width, C>0]') 221 | num_channels = image.get_shape().as_list()[-1] 222 | if len(means) != num_channels: 223 | raise ValueError('len(means) must match the number of channels') 224 | 225 | channels = tf.split(axis=2, num_or_size_splits=num_channels, value=image) 226 | for i in range(num_channels): 227 | channels[i] -= means[i] 228 | return tf.concat(axis=2, values=channels) 229 | 230 | 231 | def _smallest_size_at_least(height, width, smallest_side): 232 | """Computes new shape with the smallest side equal to `smallest_side`. 233 | 234 | Computes new shape with the smallest side equal to `smallest_side` while 235 | preserving the original aspect ratio. 236 | 237 | Args: 238 | height: an int32 scalar tensor indicating the current height. 239 | width: an int32 scalar tensor indicating the current width. 240 | smallest_side: A python integer or scalar `Tensor` indicating the size of 241 | the smallest side after resize. 242 | 243 | Returns: 244 | new_height: an int32 scalar tensor indicating the new height. 245 | new_width: and int32 scalar tensor indicating the new width. 246 | """ 247 | smallest_side = tf.convert_to_tensor(smallest_side, dtype=tf.int32) 248 | 249 | height = tf.to_float(height) 250 | width = tf.to_float(width) 251 | smallest_side = tf.to_float(smallest_side) 252 | 253 | scale = tf.cond(tf.greater(height, width), 254 | lambda: smallest_side / width, 255 | lambda: smallest_side / height) 256 | new_height = tf.to_int32(tf.rint(height * scale)) 257 | new_width = tf.to_int32(tf.rint(width * scale)) 258 | return new_height, new_width 259 | 260 | 261 | def _aspect_preserving_resize(image, smallest_side): 262 | """Resize images preserving the original aspect ratio. 263 | 264 | Args: 265 | image: A 3-D image `Tensor`. 266 | smallest_side: A python integer or scalar `Tensor` indicating the size of 267 | the smallest side after resize. 268 | 269 | Returns: 270 | resized_image: A 3-D tensor containing the resized image. 271 | """ 272 | smallest_side = tf.convert_to_tensor(smallest_side, dtype=tf.int32) 273 | 274 | shape = tf.shape(image) 275 | height = shape[0] 276 | width = shape[1] 277 | new_height, new_width = _smallest_size_at_least(height, width, smallest_side) 278 | image = tf.expand_dims(image, 0) 279 | resized_image = tf.image.resize_bilinear(image, [new_height, new_width], 280 | align_corners=False) 281 | resized_image = tf.squeeze(resized_image) 282 | resized_image.set_shape([None, None, 3]) 283 | return resized_image 284 | 285 | 286 | def preprocess_for_train(image, 287 | output_height, 288 | output_width, 289 | resize_side_min=_RESIZE_SIDE_MIN, 290 | resize_side_max=_RESIZE_SIDE_MAX): 291 | """Preprocesses the given image for training. 292 | 293 | Note that the actual resizing scale is sampled from 294 | [`resize_size_min`, `resize_size_max`]. 295 | 296 | Args: 297 | image: A `Tensor` representing an image of arbitrary size. 298 | output_height: The height of the image after preprocessing. 299 | output_width: The width of the image after preprocessing. 300 | resize_side_min: The lower bound for the smallest side of the image for 301 | aspect-preserving resizing. 302 | resize_side_max: The upper bound for the smallest side of the image for 303 | aspect-preserving resizing. 304 | 305 | Returns: 306 | A preprocessed image. 307 | """ 308 | resize_side = tf.random_uniform( 309 | [], minval=resize_side_min, maxval=resize_side_max+1, dtype=tf.int32) 310 | 311 | image = _aspect_preserving_resize(image, resize_side) 312 | image = _random_crop([image], output_height, output_width)[0] 313 | image.set_shape([output_height, output_width, 3]) 314 | image = tf.to_float(image) 315 | image = tf.image.random_flip_left_right(image) 316 | return _mean_image_subtraction(image, [_R_MEAN, _G_MEAN, _B_MEAN]) 317 | 318 | 319 | def preprocess_for_eval(image, output_height, output_width, resize_side): 320 | """Preprocesses the given image for evaluation. 321 | 322 | Args: 323 | image: A `Tensor` representing an image of arbitrary size. 324 | output_height: The height of the image after preprocessing. 325 | output_width: The width of the image after preprocessing. 326 | resize_side: The smallest side of the image for aspect-preserving resizing. 327 | 328 | Returns: 329 | A preprocessed image. 330 | """ 331 | image = _aspect_preserving_resize(image, resize_side) 332 | image = _central_crop([image], output_height, output_width)[0] 333 | image.set_shape([output_height, output_width, 3]) 334 | image = tf.to_float(image) 335 | return _mean_image_subtraction(image, [_R_MEAN, _G_MEAN, _B_MEAN]) 336 | 337 | 338 | def preprocess_image(image, output_height, output_width, is_training=False, 339 | resize_side_min=_RESIZE_SIDE_MIN, 340 | resize_side_max=_RESIZE_SIDE_MAX): 341 | """Preprocesses the given image. 342 | 343 | Args: 344 | image: A `Tensor` representing an image of arbitrary size. 345 | output_height: The height of the image after preprocessing. 346 | output_width: The width of the image after preprocessing. 347 | is_training: `True` if we're preprocessing the image for training and 348 | `False` otherwise. 349 | resize_side_min: The lower bound for the smallest side of the image for 350 | aspect-preserving resizing. If `is_training` is `False`, then this value 351 | is used for rescaling. 352 | resize_side_max: The upper bound for the smallest side of the image for 353 | aspect-preserving resizing. If `is_training` is `False`, this value is 354 | ignored. Otherwise, the resize side is sampled from 355 | [resize_size_min, resize_size_max]. 356 | 357 | Returns: 358 | A preprocessed image. 359 | """ 360 | if is_training: 361 | return preprocess_for_train(image, output_height, output_width, 362 | resize_side_min, resize_side_max) 363 | else: 364 | return preprocess_for_eval(image, output_height, output_width, 365 | resize_side_min) 366 | -------------------------------------------------------------------------------- /model/similarity.py: -------------------------------------------------------------------------------- 1 | import tensorflow as tf 2 | 3 | 4 | def chamfer_similarity(sim, max_axis=1, mean_axis=0): 5 | sim = tf.reduce_max(sim, axis=max_axis, keepdims=True) 6 | sim = tf.reduce_mean(sim, axis=mean_axis, keepdims=True) 7 | return tf.squeeze(sim, [max_axis, mean_axis]) 8 | 9 | 10 | def symmetric_chamfer_similarity(sim, axes=[0, 1]): 11 | return (chamfer_similarity(sim, axes[0], axes[1]) + 12 | chamfer_similarity(sim, axes[1], axes[0])) / 2 13 | 14 | 15 | def triplet_loss(sim_pos, sim_neg, gamma=0.5): 16 | with tf.variable_scope('triplet_loss'): 17 | return tf.maximum(0., sim_neg - sim_pos + gamma) 18 | 19 | 20 | def similarity_regularization_loss(sim, lower_limit=-1., upper_limit=1.): 21 | with tf.variable_scope('similarity_regularization_loss'): 22 | return tf.reduce_sum(tf.abs(tf.minimum(.0, sim - lower_limit))) + \ 23 | tf.reduce_sum(tf.abs(tf.maximum(.0, sim - upper_limit))) 24 | -------------------------------------------------------------------------------- /model/visil.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import tensorflow as tf 3 | 4 | from .layers import PCA_layer, Attention_layer, Video_Comparator 5 | from .similarity import chamfer_similarity, symmetric_chamfer_similarity 6 | 7 | 8 | class ViSiL(object): 9 | 10 | def __init__(self, model_dir, net='resnet', load_queries=False, 11 | dims=None, whitening=True, attention=True, video_comparator=True, 12 | queries_number=None, gpu_id=0, similarity_function='chamfer'): 13 | 14 | self.net = net 15 | if self.net not in ['resnet', 'i3d']: 16 | raise Exception('[ERROR] Not supported backbone network: {}. ' 17 | 'Supported options: resnet or i3d'.format(self.net)) 18 | 19 | self.load_queries = load_queries 20 | if whitening or dims is not None: 21 | self.PCA = PCA_layer(dims=dims, whitening=whitening, net=self.net) 22 | if attention: 23 | if whitening and dims is None: 24 | self.att = Attention_layer(shape=400 if self.net == 'i3d' else 3840) 25 | else: 26 | print('[WARNING] Attention layer has been deactivated. ' 27 | 'It works only with Whitening layer of {} dimensions. '. 28 | format(400 if self.net == 'i3d' else 3840)) 29 | if video_comparator: 30 | if whitening and attention and dims is None: 31 | self.vid_comp = Video_Comparator() 32 | else: 33 | print('[WARNING] Video comparator has been deactivated. ' 34 | 'It works only with Whitening layer of {} dimensions ' 35 | 'and Attention layer. '.format(400 if self.net == 'i3d' else 3840)) 36 | 37 | if similarity_function == 'chamfer': 38 | self.f2f_sim = lambda x: chamfer_similarity(x, max_axis=2, mean_axis=1) 39 | self.v2v_sim = lambda x: chamfer_similarity(x, max_axis=1, mean_axis=0) 40 | elif similarity_function == 'symmetric_chamfer': 41 | self.f2f_sim = lambda x: symmetric_chamfer_similarity(x, axes=[1, 2]) 42 | self.v2v_sim = lambda x: symmetric_chamfer_similarity(x, axes=[0, 1]) 43 | else: 44 | raise Exception('[ERROR] Not implemented similarity function: {}. ' 45 | 'Supported options: chamfer or symmetric_chamfer'.format(similarity_function)) 46 | 47 | self.frames = tf.placeholder(tf.uint8, shape=(None, None, None, 3), name='input') 48 | with tf.device('/cpu:0'): 49 | if self.net == 'resnet': 50 | processed_frames = self.preprocess_resnet(self.frames) 51 | elif self.net == 'i3d': 52 | processed_frames = self.preprocess_i3d(self.frames) 53 | 54 | with tf.device('/gpu:%i' % gpu_id): 55 | self.region_vectors = self.extract_region_vectors(processed_frames) 56 | if self.load_queries: 57 | print('[INFO] Queries will be loaded to the gpu') 58 | self.queries = [tf.Variable(np.zeros((1, 9, 3840)), dtype=tf.float32, 59 | validate_shape=False) for _ in range(queries_number)] 60 | self.target = tf.placeholder(tf.float32, [None, None, None], name='target') 61 | self.similarities = [] 62 | for q in self.queries: 63 | sim_matrix = self.frame_to_frame_similarity(q, self.target) 64 | similarity = self.video_to_video_similarity(sim_matrix) 65 | self.similarities.append(similarity) 66 | else: 67 | print('[INFO] Queries will NOT be loaded to the gpu') 68 | self.query = tf.placeholder(tf.float32, [None, None, None], name='query') 69 | self.target = tf.placeholder(tf.float32, [None, None, None], name='target') 70 | self.sim_matrix = self.frame_to_frame_similarity(self.query, self.target) 71 | self.similarity = self.video_to_video_similarity(self.sim_matrix) 72 | 73 | init = self.load_model(model_dir) 74 | config = tf.ConfigProto(allow_soft_placement=True) 75 | config.gpu_options.allow_growth = True 76 | self.sess = tf.Session(config=config) 77 | self.sess.run(init) 78 | 79 | def preprocess_resnet(self, video): 80 | from .nets import vgg_preprocessing 81 | video = tf.map_fn(lambda x: vgg_preprocessing.preprocess_for_eval(x, 224, 224, 256), video, 82 | parallel_iterations=100, dtype=tf.float32, swap_memory=True) 83 | return video 84 | 85 | def preprocess_i3d(self, video): 86 | video = tf.image.resize_with_crop_or_pad(video, 224, 224) 87 | video = tf.image.convert_image_dtype(video, dtype=tf.float32) 88 | video = tf.multiply(tf.subtract(video, 0.5), 2.0) 89 | video = tf.expand_dims(video, 0) 90 | return video 91 | 92 | def region_pooling(self, video): 93 | if self.net == 'resnet': 94 | from .nets import resnet_v1 95 | with tf.contrib.slim.arg_scope(resnet_v1.resnet_arg_scope()): 96 | _, network = resnet_v1.resnet_v1_50(video, num_classes=None, is_training=False) 97 | 98 | layers = [['resnet_v1_50/block1', 8], ['resnet_v1_50/block2', 4], 99 | ['resnet_v1_50/block3', 2], ['resnet_v1_50/block4', 2]] 100 | 101 | with tf.variable_scope('region_vectors'): 102 | features = [] 103 | for l, p in layers: 104 | logits = tf.nn.relu(network[l]) 105 | logits = tf.layers.max_pooling2d(logits, [np.floor(p+p/2), np.floor(p+p/2)], p, padding='VALID') 106 | logits = tf.nn.l2_normalize(logits, -1, epsilon=1e-15) 107 | features.append(logits) 108 | logits = tf.concat(features, axis=-1) 109 | logits = tf.nn.l2_normalize(logits, -1, epsilon=1e-15) 110 | logits = tf.reshape(logits, [tf.shape(logits)[0], -1, tf.shape(logits)[-1]]) 111 | elif self.net == 'i3d': 112 | from .nets import i3d 113 | with tf.variable_scope('RGB'): 114 | model = i3d.InceptionI3d(400, spatial_squeeze=True, final_endpoint='Logits') 115 | logits, _ = model(video, is_training=False, dropout_keep_prob=1.0) 116 | 117 | with tf.variable_scope('region_vectors'): 118 | logits = tf.nn.l2_normalize(tf.nn.relu(logits), -1, epsilon=1e-15) 119 | 120 | return tf.expand_dims(logits, axis=1) if len(logits.shape) < 3 else logits 121 | 122 | def extract_region_vectors(self, video): 123 | logits = self.region_pooling(video) 124 | if hasattr(self, 'PCA'): 125 | logits = tf.nn.l2_normalize(self.PCA(logits), -1, epsilon=1e-15) 126 | if hasattr(self, 'att'): 127 | logits, weights = self.att(logits) 128 | return logits 129 | 130 | def frame_to_frame_similarity(self, query, target): 131 | tensor_dot = tf.tensordot(query, tf.transpose(target), axes=1) 132 | sim_matrix = self.f2f_sim(tensor_dot) 133 | return sim_matrix 134 | 135 | def video_to_video_similarity(self, sim): 136 | if hasattr(self, 'vid_comp'): 137 | sim = self.vid_comp(sim) 138 | self.visil_output = sim 139 | sim = self.v2v_sim(sim) 140 | return sim 141 | 142 | def load_model(self, model_path): 143 | previous_variables = [var_name for var_name, _ in tf.contrib.framework.list_variables(model_path)] 144 | restore_map = {variable.op.name: variable for variable in tf.global_variables() 145 | if variable.op.name in previous_variables and 'PCA' not in variable.op.name} # 146 | print('[INFO] {} layers loaded'.format(len(restore_map))) 147 | tf.contrib.framework.init_from_checkpoint(model_path, restore_map) 148 | tf_init = tf.global_variables_initializer() 149 | return tf_init 150 | 151 | def extract_features(self, frames, batch_sz): 152 | features = [] 153 | for b in range(frames.shape[0] // batch_sz + 1): 154 | batch = frames[b * batch_sz: (b+1) * batch_sz] 155 | if batch.shape[0] > 0: 156 | if batch.shape[0] >= batch_sz or self.net == 'resnet': 157 | features.append(self.sess.run(self.region_vectors, feed_dict={self.frames: batch})) 158 | features = np.concatenate(features, axis=0) 159 | while features.shape[0] < 4: 160 | features = np.concatenate([features, features], axis=0) 161 | return features 162 | 163 | def set_queries(self, queries): 164 | if self.load_queries: 165 | self.sess.run([tf.assign(self.queries[i], queries[i], validate_shape=False) for i in range(len(queries))]) 166 | else: 167 | self.queries = queries 168 | 169 | def add_query(self, query): 170 | if self.load_queries: 171 | raise Exception('[ERROR] Operation not permitted when queries are loaded to GPU.') 172 | else: 173 | self.queries.append(query) 174 | 175 | def calculate_similarities_to_queries(self, target): 176 | if self.load_queries: 177 | return self.sess.run(self.similarities, feed_dict={self.target: target}) 178 | else: 179 | return [self.calculate_video_similarity(q, target) for q in self.queries] 180 | 181 | def calculate_video_similarity(self, query, target): 182 | return self.sess.run(self.similarity, feed_dict={self.query: query, self.target: target}) 183 | 184 | def calculate_f2f_matrix(self, query, target): 185 | return self.sess.run(self.sim_matrix, feed_dict={self.query: query, self.target: target}) 186 | 187 | def calculate_visil_output(self, query, target): 188 | return self.sess.run(self.visil_output, feed_dict={self.query: query, self.target: target}) 189 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | numpy>=1.9 2 | tqdm>=4.2 3 | opencv-python>=3.1 -------------------------------------------------------------------------------- /video_similarity.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MKLab-ITI/visil/0971e54fb8325fceb1bc9748ecbfe4c66e5dabd2/video_similarity.png --------------------------------------------------------------------------------