├── .Rhistory
├── .gitignore
├── Papers
    ├── img
    │   ├── A1-1.png
    │   ├── A1-2.png
    │   ├── A1-4.png
    │   ├── A1-5.png
    │   ├── A1-6.png
    │   ├── A1-7.png
    │   ├── A1-8.png
    │   ├── A2-1.png
    │   ├── A2-2.png
    │   ├── A2-3.png
    │   ├── A2-4.png
    │   ├── A2-5.png
    │   ├── A2-6.png
    │   ├── B1-1.png
    │   ├── B1-3.png
    │   ├── B1-4.png
    │   ├── B1-5.png
    │   ├── B1-6.png
    │   ├── C1-1.png
    │   ├── C1-2.png
    │   ├── C1-3.png
    │   ├── C1-4.png
    │   ├── C1-5.png
    │   ├── C1-6.png
    │   ├── C1-7.png
    │   ├── A1-3a.png
    │   ├── A1-3b.png
    │   ├── B1-2a.png
    │   └── B1-2b.png
    ├── PaperA2.md
    ├── PaperA1.md
    ├── PaperC1.md
    └── PaperB1.md
├── LICENSE
└── readme.md


/.Rhistory:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | _site/
2 | .sass-cache/
3 | .jekyll-metadata
4 | .DS_Store
5 | Gemfile.lock
6 | 


--------------------------------------------------------------------------------
/Papers/img/A1-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BardOfCodes/DRL_in_CV_Papers/HEAD/Papers/img/A1-1.png


--------------------------------------------------------------------------------
/Papers/img/A1-2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BardOfCodes/DRL_in_CV_Papers/HEAD/Papers/img/A1-2.png


--------------------------------------------------------------------------------
/Papers/img/A1-4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BardOfCodes/DRL_in_CV_Papers/HEAD/Papers/img/A1-4.png


--------------------------------------------------------------------------------
/Papers/img/A1-5.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BardOfCodes/DRL_in_CV_Papers/HEAD/Papers/img/A1-5.png


--------------------------------------------------------------------------------
/Papers/img/A1-6.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BardOfCodes/DRL_in_CV_Papers/HEAD/Papers/img/A1-6.png


--------------------------------------------------------------------------------
/Papers/img/A1-7.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BardOfCodes/DRL_in_CV_Papers/HEAD/Papers/img/A1-7.png


--------------------------------------------------------------------------------
/Papers/img/A1-8.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BardOfCodes/DRL_in_CV_Papers/HEAD/Papers/img/A1-8.png


--------------------------------------------------------------------------------
/Papers/img/A2-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BardOfCodes/DRL_in_CV_Papers/HEAD/Papers/img/A2-1.png


--------------------------------------------------------------------------------
/Papers/img/A2-2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BardOfCodes/DRL_in_CV_Papers/HEAD/Papers/img/A2-2.png


--------------------------------------------------------------------------------
/Papers/img/A2-3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BardOfCodes/DRL_in_CV_Papers/HEAD/Papers/img/A2-3.png


--------------------------------------------------------------------------------
/Papers/img/A2-4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BardOfCodes/DRL_in_CV_Papers/HEAD/Papers/img/A2-4.png


--------------------------------------------------------------------------------
/Papers/img/A2-5.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BardOfCodes/DRL_in_CV_Papers/HEAD/Papers/img/A2-5.png


--------------------------------------------------------------------------------
/Papers/img/A2-6.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BardOfCodes/DRL_in_CV_Papers/HEAD/Papers/img/A2-6.png


--------------------------------------------------------------------------------
/Papers/img/B1-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BardOfCodes/DRL_in_CV_Papers/HEAD/Papers/img/B1-1.png


--------------------------------------------------------------------------------
/Papers/img/B1-3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BardOfCodes/DRL_in_CV_Papers/HEAD/Papers/img/B1-3.png


--------------------------------------------------------------------------------
/Papers/img/B1-4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BardOfCodes/DRL_in_CV_Papers/HEAD/Papers/img/B1-4.png


--------------------------------------------------------------------------------
/Papers/img/B1-5.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BardOfCodes/DRL_in_CV_Papers/HEAD/Papers/img/B1-5.png


--------------------------------------------------------------------------------
/Papers/img/B1-6.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BardOfCodes/DRL_in_CV_Papers/HEAD/Papers/img/B1-6.png


--------------------------------------------------------------------------------
/Papers/img/C1-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BardOfCodes/DRL_in_CV_Papers/HEAD/Papers/img/C1-1.png


--------------------------------------------------------------------------------
/Papers/img/C1-2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BardOfCodes/DRL_in_CV_Papers/HEAD/Papers/img/C1-2.png


--------------------------------------------------------------------------------
/Papers/img/C1-3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BardOfCodes/DRL_in_CV_Papers/HEAD/Papers/img/C1-3.png


--------------------------------------------------------------------------------
/Papers/img/C1-4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BardOfCodes/DRL_in_CV_Papers/HEAD/Papers/img/C1-4.png


--------------------------------------------------------------------------------
/Papers/img/C1-5.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BardOfCodes/DRL_in_CV_Papers/HEAD/Papers/img/C1-5.png


--------------------------------------------------------------------------------
/Papers/img/C1-6.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BardOfCodes/DRL_in_CV_Papers/HEAD/Papers/img/C1-6.png


--------------------------------------------------------------------------------
/Papers/img/C1-7.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BardOfCodes/DRL_in_CV_Papers/HEAD/Papers/img/C1-7.png


--------------------------------------------------------------------------------
/Papers/img/A1-3a.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BardOfCodes/DRL_in_CV_Papers/HEAD/Papers/img/A1-3a.png


--------------------------------------------------------------------------------
/Papers/img/A1-3b.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BardOfCodes/DRL_in_CV_Papers/HEAD/Papers/img/A1-3b.png


--------------------------------------------------------------------------------
/Papers/img/B1-2a.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BardOfCodes/DRL_in_CV_Papers/HEAD/Papers/img/B1-2a.png


--------------------------------------------------------------------------------
/Papers/img/B1-2b.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BardOfCodes/DRL_in_CV_Papers/HEAD/Papers/img/B1-2b.png


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2017 Aditya Ganeshan
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/Papers/PaperA2.md:
--------------------------------------------------------------------------------
 1 | ---
 2 | id: A2
 3 | type: paper
 4 | post-link: Papers/PaperA2.html
 5 | layout: post
 6 | title: Hierarchical Object Detection with Deep Reinforcement Learning.
 7 | category: object detection
 8 | author: Bellver, Miriam, et al
 9 | conference: NIPS 2016
10 | link: https://arxiv.org/abs/1611.03718
11 | ---
12 | 
13 | #### Problem Statement
14 | The aim is to localize objects in scenes, a.k.a Object Detection. The key idea is to focus on those parts of the image that contain richer information and zoom on them. A RL agent is trained such that, given an image window, is capable of deciding where to focus the attention among five different predefined region candidates (smaller windows). The agent's performance is evaluated on the Pascal VOC 2007 dataset.
15 | 
16 | #### Proposed Model Outcome
17 | This is he proposed model where th actions involve non-overlapping actions in first case, and overlapping actions in the second.
18 | <center><img src="img/A2-1.png" alt="Overview" style=""></center>
19 | <center><img src="img/A2-2.png" alt="Actions" style=""></center>
20 | 
21 | #### RL Components
22 | The problem has been modeled as a Markov Decision Process. Formally, the MDP has a set of actions A, a set of states S, and a reward function R.
23 | * Actions
24 | There are two types of possible actions: movement actions that imply a change in the current observed region, and the terminal action to indicate that the object is found and that the search has ended.
25 | 
26 | * State Space
27 | The state is composed by the descriptor of the current region and a memory vector. The memory vector of the state captures the last 4 actions that the agent has already performed in the search for an object. As the agent is learning a refinement of a bounding box, a memory vector that encodes the state of this refinement procedure is useful to stabilize the search trajectories.
28 | 
29 | * Rewards
30 | Let b be the box of an observable region, and g the ground truth box for a target object. The reward function $R_{a}(s,s_{0})$ is granted to the agent when it chooses the action a to move from state s to $s_0$. Each state s has an associated box b that contains the attended region. Then, the reward is as follows:
31 | <center><img src="img/A2-3.png" alt="Rewards" style=""></center>
32 | 
33 | #### Network Architecture
34 | They use a Q-network that takes as input the state representation discussed.
35 | <center><img src="img/A2-6.png" alt="Model" style=""></center>
36 | 
37 | #### Qualitative Results
38 | <center><img src="img/A2-4.png" alt="Model" style=""></center>
39 | 
40 | #### Why use Reinforcement Learing here?
41 | One gets curious as to why we should employ RL, if the results are at most equivalent to SOTA. An histogram of the amount of regions analyzed by the agent is shown. It is observed that the major part of objects are already found with a single step, which means that the object occupies the major part of the image. With less than 3 steps we can almost approximate all objects we can detect.
42 | <center><img src="img/A2-5.png" alt="Model" style=""></center>
43 | 
44 | Note: All images have been taken from the Paper.
45 | 
46 | 


--------------------------------------------------------------------------------
/Papers/PaperA1.md:
--------------------------------------------------------------------------------
 1 | ---
 2 | id: A1
 3 | type: paper
 4 | post-link: Papers/PaperA1.html
 5 | layout: post
 6 | title: Active object localization with deep reinforcement learning.
 7 | category: object detection
 8 | author: Caicedo, Juan C., and Svetlana Lazebnik
 9 | conference: ICCV 2015
10 | link: https://arxiv.org/abs/1511.06015
11 | ---
12 | 
13 | #### Problem Statement
14 | The aim is to localize objects in scenes, a.k.a Object Detection. To do so, an active detection model is presented which is class-specific. A RL agent is trained such to deform a bounding box using simple transformation actions, with the goal of determining the most specific location of target objects following top-down reasoning. The agent's performance is evaluated on the Pascal VOC 2007 dataset.
15 | 
16 | #### Proposed Model Outcome
17 | <center><img src="img/A1-1.png" alt="Overview" style=""></center>
18 | 
19 | #### RL Components
20 | The problem has been modeled as a Markov Decision Process. Formally, the MDP has a set of actions A, a set of states S, and a reward function R.
21 | * Actions
22 | The set of actions A is composed of eight transformations that can be applied to the box and one action to terminate the search process. The set of actions are as follows:
23 | <center><img src="img/A1-2.png" alt="Actions" style=""></center>
24 | 
25 | * State Space
26 | The state representation is a tuple (o, h), where o is a feature vector of the observed region, and h is a vector with the history of taken actions.
27 | 
28 | * Rewards
29 | Let b be the box of an observable region, and g the ground truth box for a target object. The reward function $R_{a}(s,s_{0})$ is granted to the agent when it chooses the action a to move from state s to $s_0$. Each state s has an associated box b that contains the attended region. Then, the reward is as follows:
30 | <center><img src="img/A1-3a.png" alt="Rewards" style=""></center>
31 | The trigger has a different reward scheme because it leads to a terminal state that does not change the box, and thus, the differential of IoU will always be zero for this action. The reward for the trigger is a thresholding function of IoU as follows:
32 | <center><img src="img/A1-3b.png" alt="Rewards" style=""></center>
33 | 
34 | #### Network Architecture
35 | They use a Q-network that takes as input the state representation discussed.
36 | <center><img src="img/A1-4.png" alt="Model" style=""></center>
37 | 
38 | #### Qualitative Results
39 | <center><img src="img/A1-5.png" alt="Model" style=""></center>
40 | <center><img src="img/A1-6.png" alt="Model" style=""></center>
41 | 
42 | #### Quntitative Results
43 | Comparison with other SOTA methods.
44 | <center><img src="img/A1-7.png" alt="Model" style=""></center>
45 | 
46 | #### Why use Reinforcement Learing here?
47 | One gets curious as to why we should employ RL, if the results are at most equivalent to SOTA. It is shown that agents guided by the proposed model are able to localize a single instance of an object after analyzing only between 11 and 25 regions in an image, and obtain the best detection results among systems that do not use object proposals for object localization.
48 | <center><img src="img/A1-8.png" alt="Model" style=""></center>
49 | The figure here shows the distribution of detections explicitly marked by the agent as a function of the number of actions required to reach the object. For each action, one region in the image needs to be processed. Most detections can be obtained with about 11 actions only.
50 | 
51 | Note: All images have been taken from the Paper.
52 | 
53 | 


--------------------------------------------------------------------------------
/readme.md:
--------------------------------------------------------------------------------
 1 | # Deep Reinforcement Learning in Computer Vision Papers
 2 | 
 3 | In recent years, while use of Computer Vision techniques/models has burgeoned for solving Reinforcement Learning task(such as games), the opposite flow, of using techinques/models from Reinforcement Learning to solve paradigms in Computer Vision has also been seen.
 4 | 
 5 | The goal is to understand this penetration of RL in many of the application of Computer Vision through research publications in leading conferences. 
 6 | 
 7 | ## Index of Papers
 8 | 
 9 | ### [A] Object Detection
10 | 	1). Caicedo, Juan C., and Svetlana Lazebnik. "Active object localization with deep reinforcement learning." Proceedings of the IEEE International Conference on Computer Vision. 2015.
11 | 	2). Bellver, Miriam, et al. "Hierarchical object detection with deep reinforcement learning." arXiv preprint arXiv:1611.03718 (2016).
12 | 
13 | ### [B] Action Detection
14 | 	1). Huang, Jingjia, et al. "A Self-Adaptive Proposal Model for Temporal Action Detection based on Reinforcement Learning." arXiv preprint arXiv:1706.07251 (2017).
15 | 	2). Yeung, Serena, et al. "End-to-end learning of action detection from frame glimpses in videos." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016.
16 | 
17 | ### [C] Visual Tracking
18 | 	1). Yoo, Sangdoo Yun1 Jongwon Choi1 Youngjoon, Kimin Yun, and Jin Young Choi. "Action-Decision Networks for Visual Tracking with Deep Reinforcement Learning".
19 | 	2). Zhang, Da, Hamid Maei, Xin Wang, and Yuan-Fang Wang. "Deep Reinforcement Learning for Visual Object Tracking in Videos." arXiv preprint arXiv:1701.08936 (2017).
20 | 	3). Xiang, Yu, Alexandre Alahi, and Silvio Savarese. "Learning to track: Online multi-object tracking by decision making." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4705-4713. 2015.
21 | 
22 | ### [D] Pose-Estimation and View-Planning Problem
23 | 	1). Krull, Alexander, et al. "PoseAgent: Budget-Constrained 6D Object Pose Estimation via Reinforcement Learning." arXiv preprint arXiv:1612.03779 (2016).
24 | 	2). Kaba, Mustafa Devrim, Mustafa Gokhan Uzunbas, and Ser Nam Lim. "A Reinforcement Learning Approach to the View Planning Problem." arXiv preprint arXiv:1610.06204 (2016).
25 | 
26 | ### [E] Natural Language Problems: Dialog Generation
27 | 	1). Jason D. Williams, Kavosh Asadi, Geoffrey Zweig. Hybrid Code Networks: practical and efficient end-to-end dialog control with supervised and reinforcement learning ACL 2017.
28 | 	2). Bhuwan Dhingra, Lihong Li, Xiujun Li, Jianfeng Gao, Yun-Nung Chen, Faisal Ahmed, Li Deng. End-to-End Reinforcement Learning of Dialogue Agents for Information Access. arXiv:1609.00777. 
29 | 	3). Jason D. Williams, Geoffrey Zweig. End-to-end LSTM-based dialog control optimized with supervised and reinforcement learning. arXiv:1606.01269.
30 | 
31 | ### [F] Natural Language Problems: Information Extraction
32 | 	1). Karthik Narasimhan, Adam Yala, Regina Barzilay. Improving Information Extraction by Acquiring External Evidence with Reinforcement Learning. EMNLP 2016. 
33 | 	2). Karthik Narasimhan, Tejas Kulkarni and Regina Barzilay. Language Understanding for Text-based Games using Deep Reinforcement Learning. EMNLP2015. 
34 | 	3). S.R.K. Branavan, H. Chen, L. Zettlemoyer and R. Barzilay. Reinforcement Learning for Mapping Instructions to Actions. ACL 2009.
35 | 
36 | ### [G] Image Captioning
37 | 	1).Ren, Zhou, Xiaoyu Wang, Ning Zhang, Xutao Lv, and Li-Jia Li. "Deep Reinforcement Learning-based Image Captioning with Embedding Reward." arXiv preprint arXiv:1704.03899 (2017).
38 | 
39 | 
40 | 


--------------------------------------------------------------------------------
/Papers/PaperC1.md:
--------------------------------------------------------------------------------
 1 | ---
 2 | id: C1
 3 | type: paper
 4 | post-link: Papers/PaperC1.html
 5 | layout: post
 6 | title: Action-Decision Networks for Visual Tracking with Deep Reinforcement Learning.
 7 | category: visual tracking
 8 | author: Yoo, Sangdoo Yun1 Jongwon Choi1 Youngjoon, Kimin Yun, and Jin Young Choi
 9 | conference: CVPR 2017
10 | link: https://www.researchgate.net/publication/319164402_Action-Decision_Networks_for_Visual_Tracking_with_Deep_Reinforcement_Learning
11 | ---
12 | 
13 | #### Problem Statement
14 | This paper proposes a novel tracker which is controlled by sequentially pursuing actions learned by deep reinforcement learning. In contrast to the existing trackers using deep networks, the proposed tracker is designed to achieve a light computation as well as satisfactory tracking accuracy in both location and scale. The deep network to control actions is pre-trained using various training sequences and fine-tuned during tracking for online adaptation to target and background changes. Through evaluation of the OTB dataset, the proposed tracker is validated to achieve a competitive performance that is three times faster than state-of-the-art, deep network–based trackers.
15 | 
16 | #### Proposed Model Outcome
17 | <center><img src="img/C1-1.png" alt="Overview" style=""></center>
18 | 
19 | #### RL Components
20 | The problem has been modeled as a Markov Decision Process. Formally, the MDP has a set of actions A, a set of states S, and a reward function R.
21 | * Actions
22 | The action space A consists of eleven types of actions including translation moves, scale changes, and stopping action. The translation moves
23 | include four directional moves, {left, right, up, down} and also have their two times larger moves. The scale changes are defined as two types, {scale up, scale down}, which maintain the aspect ratio of the tracking target. Each action is encoded by the 11-dimensional vector with one-hot form.
24 | <center><img src="img/C1-3.png" alt="Actions" style=""></center>
25 | 
26 | * State Space
27 | The state is defined as a tuple (p, d), where p denotes the image patch within the bounding box (we call simply “patch” in the following)
28 | and d represents the dynamics of actions denoted by a vector (called by “action dynamics vector” in the following) containing the previous k actions.
29 | 
30 | * Rewards
31 | The reward function is defined as r(s) since the agent obtains the reward by the state s regardless of the action a. The reward $r(s_{t})$ keeps zero during iteration in MDP in a frame. At the termination step T, that is, a T is ‘stop’ action, $r(s_T)$ is assigned by,
32 | <center><img src="img/C1-4.png" alt="Rewards" style=""></center>
33 | 
34 | #### Network Architecture
35 | <center><img src="img/C1-2.png" alt="Model" style=""></center>
36 | 
37 | #### Qualitative Results
38 | Some qualitative results are as follows.
39 | <center><img src="img/C1-5.png" alt="Model" style=""></center>
40 | <center><img src="img/C1-7.png" alt="Model" style=""></center>
41 | 
42 | #### Quantitative Results
43 | Comparison with other SOTA methods.
44 | <center><img src="img/C1-6.png" alt="Model" style=""></center>
45 | 
46 | #### Why use Reinforcement Learing here?
47 | Reinforcement learning makes the use of partially labeled data possible, which could greatly benefit actual applications. According to the
48 | evaluation results, the proposed tracker achieves a state-of-the-art performance in 3 fps, which is three times faster than the existing deep network-based trackers adopting a tracking-by-detection strategy. Furthermore, the fast version of the proposed tracker achieves a real-time speed (15 fps) with an accuracy that outperforms state-of-the-art real-time trackers.
49 | 
50 | Note: All images have been taken from the Paper.
51 | 
52 | 


--------------------------------------------------------------------------------
/Papers/PaperB1.md:
--------------------------------------------------------------------------------
 1 | ---
 2 | id: B1
 3 | type: paper
 4 | post-link: Papers/PaperB1.html
 5 | layout: post
 6 | title: A Self-Adaptive Proposal Model for Temporal Action Detection based on Reinforcement Learning.
 7 | category: action detection
 8 | author: Jingjia Huang, Nannan Li, Tao Zhang, Ge Li
 9 | conference: arxiv.org
10 | link: https://arxiv.org/abs/1706.07251
11 | ---
12 | 
13 | #### Problem Statement
14 | The aim of the paper is to perform Action Detection Task, i.e. whether an action occurs somewhere in a video, and also when it occurs. In this paper, we propose a class-specific action detection model that learns to continuously adjust the current region to cover the groundtruth more precisely in a self-adapted way. This is achieved by applying a sequence of transformations to a temporal window that is initially placed in the video at random and finally finds and covers action region as large as possible. The sequence of transformation is decided by an agent that analyzes the content of the current attended region and select the next best action according to a learned policy, which is trained via reinforcement learning based on Deep Q-Learning algorithm. The results are computed on THUMOS'14 dataset.
15 | 
16 | #### Proposed Model Outcome
17 | <center><img src="img/B1-1.png" alt="Overview" style=""></center>
18 | 
19 | #### RL Components
20 | The problem has been modeled as a Markov Decision Process. Formally, the MDP has a set of actions A, a set of states S, and a reward function R.
21 | * Actions
22 | The set of actions A can be divided into two categories: one group for transformation on temporal window, such as "move left", "move right", "expand left", and the remaining one for terminating the search, "trigger". The transformation group includes regular actions that comprises of translation and scale, and one irregular action. The regular actions vary the current window in terms of position or time span around the attended region, such as "move left", "expand left" or "shrink", which are adopted by the agent to increase the intersection with the
23 | groundtruth that has overlaps with the current window. The irregular action, namely "jump", translates the window to a new position away from the current site to avoid that the agent traps itself round the present location when there is no motion occurring nearby.
24 | <center><img src="img/B1-6.png" alt="Actions" style=""></center>
25 | 
26 | * State Space
27 | The state of MDP is the concatenation of two components: the presentation of current window and the history of taken actions. To describe the motion within current window generally, the feature extracted from the C3D CNN model, which is pretrained on Sports-1M and finetuned on UCF-101, is utilized as the presentation.
28 | 
29 | * Rewards
30 | The reward function R(s, a) provides a feedback to the agent when it performs the action a at the current state s, which awards the agent for actions that will bring about the improvement of motion localization accuracy while gives the punishment for actions that leads to the decline of the accuracy. The quality of motion localization is evaluated via the simple yet indicative measurement, Intersection over Union (IoU) between current attended temporal window and the groundtruth.
31 | <center><img src="img/B1-2a.png" alt="Rewards" style=""></center>
32 | The trigger has a different reward scheme because it leads to a terminal state that does not change the box, and thus, the differential of IoU will always be zero for this action. The reward for the trigger is a thresholding function of IoU as follows:
33 | <center><img src="img/B1-2b.png" alt="Rewards" style=""></center>
34 | 
35 | #### Regressor Network
36 | Inspired by Fast R-CNN, where a regression network is incorporated to revise the position deviation between the predicted result and the groundtruth, we also introduce a regression model to refine the motion proposals. The regression channel accepts 4096-dimension feature vector as input and gives out two coordinate offsets on both starting and end moment. Unlike spatial bounding box regression, in which coordinate scaling is needed due to various camera-projection perspectives, we directly utilize original temporal coordinate (i.e. frame
37 | number) for offsets calculation leveraging the advantage of unified frame rate among video clips in our experiment. They use a Q-network that takes as input the state representation discussed.
38 | 
39 | #### Qualitative Results
40 | Some qualitative results are as follows.
41 | <center><img src="img/B1-3.png" alt="Model" style=""></center>
42 | 
43 | #### Quantitative Results
44 | Comparison with other SOTA methods.
45 | <center><img src="img/B1-4.png" alt="Model" style=""></center>
46 | <center><img src="img/B1-5.png" alt="Model" style=""></center>
47 | 
48 | #### Why use Reinforcement Learing here?
49 | One gets curious as to why we should employ RL, if the results are at most equivalent to SOTA. It is shown that agents guided by the proposed model are able to localize a single instance of an action after analyzing only few frames of a video. The run-time property of the method is dependent on the DQN’s performance. For a well trained DQN agent, it will concentrate on the ground truth in a couple of steps once it perceives the action segment. Meanwhile, it can also accelerate the exploring process over the video with "jump" action. Besides, the selection of scalar α is also an important factor that will influence the run-time performance. A large α will make the agent take a brief glance over the video in most of the case, but will also result in coarse proposals. As a trade off,we set the α = 0.2 during the training and testing phase.
50 | 
51 | Note: All images have been taken from the Paper.
52 | 
53 | 


--------------------------------------------------------------------------------