├── images ├── svms.png ├── combining_regions.png ├── feature_extractor.png ├── object_detection_pic.png └── Fast_R_CNN_architecture.png ├── README.md ├── Fast_R_CNN.md ├── Faster_R_CNN.md └── R_CNN.md /images/svms.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hatemr/image-analysis/master/images/svms.png -------------------------------------------------------------------------------- /images/combining_regions.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hatemr/image-analysis/master/images/combining_regions.png -------------------------------------------------------------------------------- /images/feature_extractor.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hatemr/image-analysis/master/images/feature_extractor.png -------------------------------------------------------------------------------- /images/object_detection_pic.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hatemr/image-analysis/master/images/object_detection_pic.png -------------------------------------------------------------------------------- /images/Fast_R_CNN_architecture.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hatemr/image-analysis/master/images/Fast_R_CNN_architecture.png -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # image-analysis 2 | These are my notes on object detection algorithms and other image-related models. 3 | 4 | # Object Detection algorithms 5 | 6 | 1. R-CNN [[notes]](R_CNN.md) 7 | 2. Fast R-CNN [[notes]](Fast_R_CNN.md) 8 | 3. Faster R-CNN and YOLO [[notes]](Faster_R_CNN.md) 9 | -------------------------------------------------------------------------------- /Fast_R_CNN.md: -------------------------------------------------------------------------------- 1 | These notes describe the R-CNN model for object detection. These are my notes from many online sources, referenced at the bottom. Thank you to their authors for their awesome explanations! 2 | 3 | # Fast R-CNN 4 | Fast R-CNN improves over its predecessor, R-CNN, in both training and inference speed by creating a end-to-end model which can be trained with back-propagation. 5 | 6 | ![object detection pic](images/object_detection_pic.png) 7 | 8 | __Task__: Locate objects in an image (object detection) 9 | 10 | __Problems with R-CNN__: 11 | 1. __Slow__: You had to calculate a feature vector for each of the proposal regions (2000). 12 | 2. __Hard to train__: R-CNN had three parts (CNN, SVM, Boudning Box Regressor, though this was optional), making training difficult. 13 | 3 __Large memory requirement__: You had to save the feature vector (4096-dim) for every region proposal. This takes a lot of memory. 14 | 4. __Selective search algorithm doesn't learn anyhting__: It's a fixed algorithm, so it can't learn and improve the model. 15 | 16 | ## Enter Fast R-CNN 17 | To make it run faster, the author combined all three parts into one architecture, which can be trained end-to-end. 18 | 19 | 1. Turn the whole image into a feature map, using a CNN. 20 | 2. For each region proposal, identify the corresponding region of the feature map, warp them into squares using an RoI pooling layer (Region of Interest). 21 | 3. Flatten this fixed-size _region proposal feature map_. Now it's a feature vector, with fixed size. 22 | 4. Feed the RoI feature vector into a fully connect neural network that has two outputs: 23 | 1. __Softmax classification__, to decide which object class we found. 24 | 2. __Bounding Box Regressor__, which oututs the bouding box coordinates for each object class. 25 | 26 | ![Fast R CNN architecture](images/Fast_R_CNN_architecture.png) 27 | 28 | ## Results 29 | The Fast C-NN trains the VGG16 network 9 times faster than R-CNN. And the inference is 213 faster and achieves a higher mAP! 30 | 31 | ## Reason 32 | The Fast R-CNN is faster than R-CNN because you don't have to feed in the 2000 region proposals to the CNN every time. Instead, the single image is feed into the CNN, only once image. 33 | 34 | ## Summary 35 | The important thing is that Fast R-CNN is one end-to-end system that we can train with back-propagation. And training and inference are much faster than R-CNN. 36 | 37 | 38 | ## References 39 | * [Part 2: Fast R-CNN](https://towardsdatascience.com/part-2-fast-r-cnn-object-detection-7303e1988464) 40 | * [R-CNN, Fast R-CNN, Faster R-CNN, YOLO — Object Detection Algorithms](https://towardsdatascience.com/r-cnn-fast-r-cnn-faster-r-cnn-yolo-object-detection-algorithms-36d53571365e) 41 | * __Fast R-CNN paper__: Fast R-CNN ([https://arxiv.org/pdf/1504.08083.pdf](https://arxiv.org/pdf/1504.08083.pdf)) -------------------------------------------------------------------------------- /Faster_R_CNN.md: -------------------------------------------------------------------------------- 1 | These notes describe the R-CNN model for object detection. These are my notes from many online sources, referenced at the bottom. Thank you to their authors for their awesome explanations! 2 | 3 | # Faster R-CNN 4 | Faster R-CNN improves over its predecessor, Fast R-CNN, in both training and inference speed by creating a end-to-end model which can be trained with back-propagation. 5 | 6 | ![object detection pic](images/object_detection_pic.png) 7 | 8 | __Task__: Locate objects in an image (object detection) 9 | 10 | __Problem with Fast R-CNN__: 11 | * Selective search for finding the proposal regions is slow. So the author designed an object detection algorithm to eliminate selective search and allow the network to learn the region proposals. 12 | 13 | ## Enter Faster R-CNN 14 | Similar to Fast R-CNN: 15 | 1. The image is feed into a CNN and outputs a convolutional feature map. 16 | 17 | Different: 18 | 19 | 2. Instead of using the select search on the feature map to identify the region proposals, a separate network is used to predict the region proposals. 20 | 3. Prediction region proposals are reshaped using an RoI pooling layer. 21 | 4. Then fed into a classifier to classify the image within the proposed region and predict the offset values for the bouding boxes (same as Fast R-CNN). 22 | 23 | ## YOLO - You Only Look Once 24 | All of the R-CNN algorithms use regions to localize the object within the image. The network does not look at the complete image. The YOLO algorithm is much different. In YOLO, a single convolutional network predicts the bounding boxes and the class probabilities for these boxes. 25 | 26 | 1. Split image into an S-by-S grid. Within each grid, we take m bouding boxes (chosen how?) 27 | 2. For each bouding box, the network outputs a class probability and offset values for the bouding box. 28 | 3. The bounding box having class probability above a threshold value is selected and used to locate the object within the image. 29 | 30 | YOLO is __much__ faster. But it struggles with small objects within the image, like a flock of birds, due to spatial constraints of the algorithm. 31 | 32 | ## YOLOv3 33 | 1. Image divided into 34 | 35 | 36 | Predicts an objectness score for each bouding box using logistic regression. This should be 1 if the bounding-box prior overlaps a ground truth object by more than any other bouding box prior. 37 | 38 | 39 | 40 | ## References 41 | * [R-CNN, Fast R-CNN, Faster R-CNN, YOLO — Object Detection Algorithms](https://towardsdatascience.com/r-cnn-fast-r-cnn-faster-r-cnn-yolo-object-detection-algorithms-36d53571365e) 42 | * __Faster R-CNN paper__: Faster R-CNN: Toward Real-Time Object Detection with Region Proposal Networks ([https://arxiv.org/pdf/1506.01497.pdf](https://arxiv.org/pdf/1506.01497.pdf)) 43 | * __You Only Look Once paper__: You Only Look Once: Unified, Real-Time Object Detection ([https://arxiv.org/pdf/1506.02640v5.pdf](https://arxiv.org/pdf/1506.02640v5.pdf)) 44 | * __YOLOv3 paper__: YOLOv3: An Incremental Improvement ([https://pjreddie.com/media/files/papers/YOLOv3.pdf](https://pjreddie.com/media/files/papers/YOLOv3.pdf)) 45 | * [Understanding YOLO](https://hackernoon.com/understanding-yolo-f5a74bbc7967) -------------------------------------------------------------------------------- /R_CNN.md: -------------------------------------------------------------------------------- 1 | 2 | # R-CNN 3 | These notes describe the R-CNN model for object detection. These are my notes from many online sources, referenced at the bottom. Thank you to their authors for their awesome explanations! 4 | 5 | ![object detection pic](images/object_detection_pic.png) 6 | 7 | __Task__: Locate objects in an image (object detection) 8 | 9 | 10 | 11 | __Approach__: Slide different sized rectangles across the region, and check each region for an object. 12 | * Problem: gives too many windows 13 | * Solution: Check only a good subset of the possible regions (proposals). Need a good algorithm to choose the region proposals. 14 | 15 | ## Region proposals 16 | R-CNN is agnostic to the region proposal method; you can use any algorithm you like (one common algorithm is the selective search algorithm). 17 | 18 | This creates 200 proposal regions; still a lot, but much smaller than the brute-force sliding window approach. 19 | 20 | ## CNN (AlexNet) 21 | With the proposal regions, we represent them with a feature vector in smaller dimensional space using a CNN. The author uses AlexNet as the feature extractor, giving a 4096d vector. 22 | 23 | To do this, the AlexNet was trained on a classification task. After training, they removed the last softmax layer, giving the 4096-dimensional layer. 24 | 25 | #### Remember: 26 | The proposal regions will be different shapes, many smaller. So we need to __resize__ every region proposal into a same-sized square before feeding into the CNN (AlexNet). 27 | 28 | ![feature extractor](/images/feature_extractor.png) 29 | 30 | ## SVM 31 | Now we have regions as 4096-dimensional feature vectors, and we need to classify whether the region contains the object or not. For classification, we use SVMs. 32 | 33 | We use one SVM for each object class. That means we have n outputs (ranging from 0 to 1) for each region, where n is the number of possible objects we want to detect. The outputs represent our confidence that the image is present. 34 | 35 | Note that we must get the feature vectors from the CNN (AlexNet) before training the SVMs. Then, we train the SVMs with the 4096-dim inputs as typical supervised learning. 36 | 37 | ![svms.png](/images/svms.png) 38 | 39 | ## Bounding Boxes 40 | The model also predicts four _offset_ values. These _offset values_ help re-center the bouding box (e.g. if the region contains a face, but the face has been but in half, then the offset values shift the region to capture more of the face). 41 | 42 | ## summary so far 43 | 1. Create image proposals 44 | 2. Create feature vector of each proposal using CNN 45 | 3. Classify each object-specific SVM 46 | 47 | ## Combine proposal regions 48 | For each object class independently, combine regions that overlap and take the higher score of the two (or more) overlapping regions. 49 | 50 | Then, keep only regions which have a score higher than 0.5. 51 | 52 | ![combine proposal region](images/combining_regions.png) 53 | 54 | ## Final note 55 | It was useful to pre-train the CNN using a large amount of data, then tune the network for the actual task (object detection). 56 | 57 | # References 58 | * [Part 1: R-CNN (Object Detection)](https://towardsdatascience.com/r-cnn-3a9beddfd55a) 59 | * __R-CNN paper__: Rich feature hierarchies for accurate object detection and semantic segmentation ([https://arxiv.org/pdf/1311.2524.pdf](https://arxiv.org/pdf/1311.2524.pdf)) 60 | --------------------------------------------------------------------------------