├── Applications ├── Image Localization │ └── README.md ├── Instance Segmentation │ ├── MaskRCNN │ │ └── README.md │ ├── PANet │ │ └── README.md │ └── README.md ├── Object Classification │ └── README.md ├── Object Detection │ ├── HOG │ │ └── README.md │ ├── Metrics │ │ └── README.md │ ├── RCNN │ │ ├── Fast-RCNN │ │ │ └── README.md │ │ ├── Faster-RCNN │ │ │ └── README.md │ │ ├── R-CNN │ │ │ └── README.md │ │ ├── README.md │ │ ├── Region Proposal Network │ │ │ └── README.md │ │ └── Selective Search │ │ │ └── README.md │ ├── README.md │ ├── Retinanet │ │ └── README.md │ ├── SPP_net │ │ └── README.md │ ├── SSD │ │ └── README.md │ └── YOLO │ │ ├── README.md │ │ ├── YOLO V1 │ │ └── README.md │ │ ├── YOLO V2 and YOLO 9000 │ │ └── README.md │ │ ├── YOLO V3 │ │ └── README.md │ │ └── YOLO v4 and v5 │ │ └── README.md ├── Object Recognition │ └── README.md └── Semantic Segmentation │ ├── DeepLab Family │ └── README.md │ ├── DenseNet based FCN │ └── README.md │ ├── Fully Convolutional Network │ └── README.md │ ├── MaskRCNN │ └── README.md │ ├── PSPNet and Fast FCN │ └── README.md │ ├── README.md │ └── U-Net and Variants │ └── README.md ├── Computer vision in Python ├── OpenCV │ └── README.md ├── Other Libraries │ └── README.md ├── README.md └── Tensorflow and Keras │ └── README.md ├── Convolutional Neural Netwoks ├── Components and Working │ └── README.md ├── MNIST_dataset__prediction.ipynb ├── README.md └── Types of Convolutions │ └── README.md ├── Fundamentals ├── Data Augmentation │ └── README.md ├── Digital Image │ └── README.md ├── Edge Detectors │ ├── Edge detecting Filters.ipynb │ └── README.md ├── Filters │ ├── Blurring Kernels.ipynb │ ├── README.md │ └── Sharpening Kernels.ipynb ├── Noisy image handling │ └── README.md └── Operations │ ├── Global Operations │ └── README.md │ ├── Local Operations │ └── README.md │ ├── Pointwise Operations │ ├── Image_transformation_4.ipynb │ ├── Image_transformations_3.ipynb │ ├── README.md │ ├── image_transformation_1.ipynb │ └── image_transformation_2.ipynb │ └── README.md ├── Generative Networks ├── Generative Adversarial Networks │ └── README.md └── README.md ├── Image Feature Extractors ├── Feature Pyramid Network │ ├── Feature_pyramid_network.ipynb │ └── README.md └── README.md ├── LICENSE ├── README.md ├── Real life Implementations ├── 3D model generation │ └── README.md ├── Image based search │ └── README.md ├── Intelligent Automobiles │ └── README.md ├── Medical Imaging │ └── README.md ├── Optical Character Recognition │ └── README.md ├── Recognition and Biometrics │ └── README.md └── Survelliance │ └── README.md ├── Special Networks ├── README.md └── Siamese Network │ └── README.md ├── Transfer Learning ├── DenseNet │ └── README.md ├── EfficientNet │ └── README.md ├── Inception │ └── README.md ├── Mobile_net │ └── README.md ├── README.md ├── Resnet │ └── README.md ├── Transfer_Learning_v1.ipynb ├── Transfer_Learning_v2.ipynb ├── Transfer_Learning_v3.ipynb ├── VGG │ └── README.md └── Xception │ └── README.md ├── Unsupervised Learning └── README.md └── Vision.png /Applications/Image Localization/README.md: -------------------------------------------------------------------------------- 1 | ### Object Localization. 2 | 3 | The computer is tasked to provide a locality that gives an idea about where the object is actually present in the image alongside the classification task. The algorithms used for image locaization are same as those used for object detections. 4 | 5 | Please refer to object detection. 6 | -------------------------------------------------------------------------------- /Applications/Instance Segmentation/MaskRCNN/README.md: -------------------------------------------------------------------------------- 1 | ### MASK-RCNN 2 | 3 | Mask-RCNN is a extended version of Faster-RCNN designed in order to achieve instance segmentation. It was proposed by Kaiming He from Facebook AI Research in the paper "Mask R-CNN". The authors added a third mask prediction branch to the pre-existing Class and bounding box predictors in faster RCNN. The authors have used a FCN or fully convolutional networks on the obtained feature maps after the ROI pooling layers, in order to learn the mask predictions. 4 | 5 | ![Arch](https://miro.medium.com/proxy/1*IWWOPIYLqqF9i_gXPmBk3g.png) 6 | 7 | The above image shows the official architecture of Mask RCNN. Similar to Faster RCNN, Mask RCNN also uses Region Proposal Network (RPN) in order to propose the region of Interests that may contain objects. After obtaining the ROI proposals, in faster RCNN, the region proposals are sent through ROI-Pooling layers, in order to transform the varying sized region proposal into fixed sized feature maps, for predictions. The Mask-RCNN authors replaced the ROI pooling layers into ROI-Align layers. The results of the ROIAlign layer are sent to the heads for detection. 8 | 9 | #### Back-bone and Head Structure details: 10 | 11 | ![Back_Head](https://miro.medium.com/max/2286/1*IV6q-xtvxdLTg1h9Jk5lGg.png) 12 | 13 | Faster RCNN uses ResNet as a the back-bone, while the Mask RCNN uses a total of 3 back-bones and compare their corresponding performances. The head structures also changes with the architectural back bone. The above image shows the backbones and their head structure. The proposed backbones are ResNet, ResNeXt, and ResNet with FPN. If we do not use FPN the authors have used two layers of convolutions before splitting else if FPN is used, the network is directly split into two heads. One head for classification and bounding box and the second head for the mask. 14 | 15 | ![Over all architecture](https://miro.medium.com/max/533/1*Tm5ia-bzsVq3lw8TJt08YA.png) 16 | 17 | The overall architecture of Mask-RCNN. 18 | 19 | #### ROI-Align layer 20 | 21 | As we have seen in case of Faster R-CNN, we use ROI-pooling to represent the varying sized ROIs in a fixed size. The ROI pooling layer divides the ROIs into fixed number of divisions, and it picks the max value from each division, so we obtain a fixed size resultant feature maps. Now, for this conversion from a h x w image to H x W image, we take h/H x w/W to be a division. But, h/H and w/W may not be integers, so we need to floor the value down, as the index can't be float values. So, every time we do this we lose some data. This process is called Quantization. In Faster RCNN, this happens twice, once when the ROIs are mapped on the actual images, and secondly in the ROI pooling layers. This does not cause much trouble in object detection, but in case of instance segmentation, this causes a significant misalignment, due to the pixelwise predictions and loss of information. To prevent the loss of information, the Mask-RCNN introduced ROI-Align. The authors used bilinear interpolations to create 4 points with interpolated values for each divisions and avoid quantizations. The max of these 4 values is taken as the representing value for the division. 22 | 23 | #### Loss 24 | 25 | The Loss function used is again an extension of the loss function used in faster-RCNN. The Faster RCNN contains two components in its loss function, while the Mask RCNN adds the third component for the Mask predictions. 26 | 27 | ![loss](https://images1.programmersought.com/60/12/12c7259bbd34305df9f55905957b7174.png) 28 | 29 | The Loss Lcls and Lbox is same as faster-RCNN. Lmask is responsible for training the masks. For K classes, M\*M size feature map is taken, and we obtain pixel wise predictions, so total K\*M\*M predictions for k classes. Sigmoid activation function is used in the last layer of the network, for predictions, and an average binary cross-entropy is used as loss function. The binary cross-entropy is calculated for all the K classes seperately and then their average is taken. If the class label is the i-th class during training, then only the loss of the i-th mask output is calculated as Lmask, and the other K-1 mask outputs contribute 0 to Lmask. 30 | 31 | References: 32 | 33 | 1. https://arxiv.org/pdf/1703.06870.pdf 34 | 2. https://medium.com/analytics-vidhya/review-mask-r-cnn-instance-segmentation-human-pose-estimation-61080a93bf4#:~:text=Mask%20R%2DCNN%20is%20one,based%20on%20Mask%20R%2DCNN. 35 | 3. https://www.programmersought.com/article/61614366875/ 36 | 37 | ROI Align: 38 | 1. https://towardsdatascience.com/understanding-region-of-interest-part-2-roi-align-and-roi-warp-f795196fc193 39 | 2. https://towardsdatascience.com/understanding-region-of-interest-part-1-roi-pooling-e4f5dd65bb44 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | 48 | 49 | 50 | -------------------------------------------------------------------------------- /Applications/Instance Segmentation/PANet/README.md: -------------------------------------------------------------------------------- 1 | ### Path Aggregation Network (PANet) 2 | 3 | PANet was first proposed by The Chinese University of Hong Kong, Peking University, SenseTime Research, and YouTu Lab, Tencent through the paper "Path Aggregation Network for Instance Segmentation". PANet has been used in several areas like in YOLO v4 for object detection. The authors introduced three chief ideas: 4 | 5 | 1. Bottom-up path augmentation for shortening the information path between the lower and higher feature map layers. 6 | 2. Adaptive feature pooling to obtain the best features after pooling from any feature level 7 | 3. Fully connected fusion layer to train the model parameters and get the spatial information in a better way at the same time to improve the mask prediction. 8 | 9 | The authors extended their work on the basic principals of Feature Pyramid Network or FPN proposed initially by Facebook AI Research (FAIR). 10 | 11 | #### Architecture 12 | 13 | ![arch](https://miro.medium.com/max/2778/1*dyFC1OPnAZGTxnaQzNrCrw.png) 14 | 15 | The above image represents the archiecture proposed by the authors. The (a) portion of the above image, shows the Feature Pyramid network. The (b) part of the image shows the augmented bottom-up path which serves as the shortcut connection. The authors used ResNets as the backbone network of the FPN. Now, as we know in FPN, we obtain a several feature maps, of different sizes based on the level in the pyramid network. We apply convolutions on the feature maps P2 to P5 which gives us the learnable features in case of normal FPN. The authors argue that as there can be more than 100 layers in the backbone network and so the complexity of the path makes it impossible for the lower level features to maintain the finer details as we move to the layers of the top down path, which is depicted by the red dotted line. The authors propose a augmented bottom-up path with clean lateral connection so that the inital low level finer features are not faded. As we can see and the authors pointed out, there will hardly be 10 layers between the shortcut P2 and the high level feature map N5, which is depicted by the green arrow. 16 | 17 | {N2, N3, N4, N5} denote the newly generated feature maps corresponding to {P2, P3, P4, P5}. N2 is simply P2 without any processing. Each building block takes a higher resolution feature map Ni and a coarser map Pi+1 through lateral connection and generates the new feature map Ni+1. Each feature map Ni first goes through a 3×3 convolutional layer with stride 2 to reduce the spatial size.Then each element of feature map Pi+1 and the down-sampled map are added through lateral connection. The fused feature map is then processed by another 3×3 convolutional layer to generate Ni+1 for following sub-networks, as mentioned by the paper. 18 | 19 | ![connect](https://miro.medium.com/max/946/1*5_6lv-DjlBREha7ZSXsQPw.png) 20 | 21 | #### Adaptive Feature Pooling 22 | 23 | Normally, for pyramidal feature extraction, where we extract different features from different levels, we mostly consider that the smaller object and finer local details come from the lower level feature maps, and the larger objects are mostly detected from the upper layers or the smaller feature maps. The authors proved this theory may not be correct. The correlations between the feature size and the feature map level may penalize the performance. So, the authors wanted to let the model choose the best fit feature level for a particular object, using an element wise sum or max fusion operations. This part is represented in (c) portion. 24 | 25 | First, for each proposal, we map them to different feature levels, as denoted by dark grey regions in the first figure (b). ROIAlign is used to pool feature grids from each level. Then a fusion operation (element-wise max or sum) is utilized to fuse feature grids from different levels. 26 | 27 | #### Fully connected fusion 28 | 29 | ![fusion](https://miro.medium.com/max/1482/1*ofrdo6nQNC74OiAHW03_jw.png) 30 | 31 | It was observed by the authors that using fully connected layers helped to learn the features in a much better way compared to FCN with sigmoid or softmax activations, as they are location sensetive since there are differrnt parameters corresponding to different locations in the input, while FCN preserves the spatial structure. PANet uses information from both of these layers to provide a more diverse view for the network while making decision as shown in the above diagram. 32 | 33 | References: 34 | 35 | 1. https://arxiv.org/pdf/1803.01534.pdf 36 | 2. https://becominghuman.ai/reading-panet-path-aggregation-network-1st-place-in-coco-2017-challenge-instance-segmentation-fe4c985cad1b 37 | 3. https://medium.com/analytics-vidhya/improving-instance-segmentation-using-path-aggregation-network-a89588f3d630 38 | 39 | 40 | 41 | 42 | 43 | 44 | -------------------------------------------------------------------------------- /Applications/Instance Segmentation/README.md: -------------------------------------------------------------------------------- 1 | ### Instance Segmentaton 2 | 3 | In Instance segmentation, the computer separately labels each instance of a class alongside the pixel-wise classification. 4 | 5 | ![Image](https://miro.medium.com/max/3072/1*c2oWLUCnyMGsC3ySbd6QNQ.jpeg) 6 | -------------------------------------------------------------------------------- /Applications/Object Classification/README.md: -------------------------------------------------------------------------------- 1 | ### Image Classification 2 | 3 | Image Classification is one of the fundamental tasks of computer vision. In these types of tasks, the computer is expected to provide a label to images. The task is the most basic challenges in the computer vision field. 4 | 5 | Image Classification task typically needs a fine-grained feature extraction, and a much larger global context, to generate the features. The task can be solved by both traditional Machine Learning algorithms or Black Box deep learning algorithms. 6 | 7 | For ML algorithms like KNN, SVM and ensemble based learrning, we need to use the traditional Image feature Extractors we talked about in the Feature Extraction portion, 8 | 9 | Currently, in most cases, Deep Learning networks are used, owing to the fact that they extract the best features automatically and set their filters accordingly. 10 | 11 | In order to use DNN for Image classification, we can either build a custom modelling or can use any of the discussed pretrained models discussed using transfer learning. 12 | 13 | We can also use deep feature exrtractors like Feature Pyramid Networks and tune the model according to the task at hand to obtain good features and thus improve the predictive results. 14 | -------------------------------------------------------------------------------- /Applications/Object Detection/HOG/README.md: -------------------------------------------------------------------------------- 1 | ### Histogram of Gradients (HOG) 2 | 3 | Histogram of Gradients or HOG is one of the most used traditional feature descriptors, for the object detection tasks. 4 | 5 | #### Gradients: 6 | Gradients are the rate of change of pixel values. Now, image is a 2D function f(x,y). So, to find the rate of change of pixel values, we need to calculate the derivative for f(x,y), so we will need to calculate the partial derivative of the function, wrt x and y. The derivative can be calculated using laplacian. 7 | 8 | ∇f(x,y)= \[gx gy] = \[∂f/∂x ∂f/∂y] = \[f(x+1,y)−f(x−1,y) f(x,y+1)−f(x,y−1)] 9 | 10 | where, 11 | 12 | ![loc](https://lilianweng.github.io/lil-log/assets/images/image-gradient-vector-pixel-location.png) 13 | 14 | It is given as the above as image is a discrete function on saving on a digital device. 15 | 16 | Magnitude of the gradient = Sqrt ( gx^2 + gy^2) 17 | Direction of the gradient = arctan ( gy / gx ) 18 | 19 | Now, in the original scenerio, it is much easier to use the given kernels in order to calculate the gradients. 1st filter can be convoluted to get ∂f/∂x and 2nd to get ∂f/∂y. 20 | 21 | ![grad](https://learnopencv.com/wp-content/uploads/2016/11/gradient-kernels.jpg) 22 | 23 | They can also be calculated using standard filters like Sobel and Prewitt filters. These filters can also be used to find gradients. 24 | 25 | ![HOG](https://learnopencv.com/wp-content/uploads/2016/12/hog-cell-gradients.png) 26 | 27 | The gradients are a very good indicator of boundaries and sharp edges. The magnitude of gradient sparks if we cross a boundary. Otherwise, gradients removes the parts where the gradient is almost constant. 28 | 29 | #### HOG 30 | 31 | In order to create HOG, we take the entire image, and correspondingly the matrix of magnitude of gradients and the matrix of gradient directions. Next we divide the image into several 8x8 pixel cells In each cell, the magnitude values of these 64 cells are binned and cumulatively added into 9 buckets of unsigned direction. 32 | 33 | For better robustness, if the direction of the gradient vector of a pixel lays between two buckets, its magnitude does not all go into the closer one but proportionally split between two. For example, if a pixel’s gradient vector has magnitude 8 and degree 15, it is between two buckets for degree 0 and 20 and we would assign 2 to bucket 0 and 6 to bucket 20. 34 | 35 | ![dist](https://lilianweng.github.io/lil-log/assets/images/HOG-histogram-creation.png) 36 | 37 | Using the above method HOG is found. 38 | 39 | Next, we create 2x2 window of such 8x8 pixel cells. So, each window contains 4 such 8x8 pixel cells. We have previously seen, from 8x8 pixel cells, we create 9x1 Histogram vector. So, from the 2x2 window we can concatenate the 4 9x1 vector, to create a 36x1 vector. This is done in order to normalize. Normally, if we use thw values without normalization, it would be very effective to light, which we do not want. Normalizing a larger vector of 36x1 proves to be more robust than a normalizing a 9x1 vector, so we adapt the sliding window. 40 | 41 | Now, initially, the method was performed for human detection, so the authors focused on the image patches of size 64 x 128, and then conducted the whole experiment. 42 | 43 | So, considering the size of the patch, we calculate how many 16x16 blocks can be there. If there are m patches horizontally and n patches vertically, for the final vector we send the concatenated vector of size 36x1 from all the windows. The total size resulting to: (m x n x 36 , 1). 44 | 45 | This vector we send to predictive algorithms like SVM, for object detection. 46 | 47 | Now, this method can be applied to any patch of aspect ratio 1:2 like 100 x 200, 105 x 210 and so on, which can be resized to 64 x 128. As far as the size 8x8 cell goes, it is due to the initial problem setting. The authors found that for a image of dimension 64 x 128, 8x8 size cells was based suited for detecting features like faces, etc. 48 | 49 | 50 | References: 51 | 52 | 1. http://lear.inrialpes.fr/people/triggs/pubs/Dalal-cvpr05.pdf 53 | 2. https://learnopencv.com/histogram-of-oriented-gradients/ 54 | 3. https://lilianweng.github.io/lil-log/2017/10/29/object-recognition-for-dummies-part-1.html 55 | 56 | 57 | 58 | -------------------------------------------------------------------------------- /Applications/Object Detection/Metrics/README.md: -------------------------------------------------------------------------------- 1 | ### Metrics 2 | 3 | In order to evaluate object detection algorithm, we use a different set of metrics. Normal, metrics of Accuracy, Precision and Recall won't work in this scenaerios as we basically need to evaluate both the classification and the localization capacity of the model. 4 | 5 | #### IOU 6 | 7 | The primary concept we want to know, is the idea of **IOU** or **Intersection Over Union**. It denotes the overlap between the predicted bounding box and the actual bounding box around the object in the image. So, it deals with how correctly our model has localized the image 8 | 9 | It is given by: 10 | 11 | IOU = (Intersection of the predicted and actual bounding boxes) / (Union of the predicted and actual bounding boxes) 12 | 13 | ![IOU](https://miro.medium.com/max/683/1*6B58Ohs9t7sRjYISbYZs-Q.png) 14 | 15 | The above image depicts the IOU in three cases. 16 | 17 | #### Confidence Score 18 | 19 | For every bounding box predicted by the object detection model, the model provides a confidence score. The confidence score depicts the confidence of the model on the fact that an object is present in the corresponding bounding box. This confidence score is used in **Non-Max Suppression** i.e, when an object is predicted by multiple bounding boxes, the box with the highest confidence score is kept and other bounding boxes are removed. 20 | 21 | #### False Positive, True Positive and False Negative 22 | 23 | True Positive is where the IOU between the predicted bounding box and the actual bounding box is above a given threshold T. So, if the threshold is 0.5 and IOU is 6, we consider it True Positive. 24 | 25 | False Positive is where the IOU between the predicted bounding box and the actual bounding box is below a given threshold T. So, if the threshold is 0.5 and IOU is 4, we consider it False Positive. 26 | 27 | False Negative is the condiion, if there is an object and a bounding box but the model fails to detect it. 28 | 29 | True Negative is all the other pixels which doesnot have an object and our model did not predict also, So, it is insignificant in this scenerio. 30 | 31 | #### Precision and Recall 32 | 33 | Precision depicts how many of the objects detected by the model are actually present in the ground truth. 34 | 35 | Given by: TP / (TP + FP) 36 | 37 | Recall depicts how many of the objects actually present in the ground truth are detected by the model. 38 | 39 | Given by: TP / (TP + FN) 40 | 41 | #### Precision-Recall Curve 42 | 43 | Precision and Recall always has a tradeoff among themselves. The predictive model during a binary classification actually predicts confidence percentages for the object belonging to respective classes. For example, say for an object, a model predicts, 0.55 confidence that the object belongs to class 1 and 0.45 that it belongs to class 0. Here, we consider a threshold *alpha*, normally 0.5. If the confidence score is above threshold, considered as 1 else 0. So, if we lower the value of the threshold, more objects will be predicted as class 1 and False Negative will decrease but, the model will start classifying objects as class 1 with lower confidence, so False Positive will increase and vice versa. 44 | 45 | So, if we roll the value of alpha we will generate a Precision-Recall curve, which is often a very good estimator for a model performance, as we need our model to have good presicion, even when we increase the recall. 46 | 47 | #### Average Precision 48 | 49 | Average Precision is given by the area under the Precision-Recall curve. We just consider the total area under the curve using an integration. Given by: 50 | 51 | ![Formula](https://miro.medium.com/max/672/1*bs56Iu0mzjAP-QAM4XAAhw.png) 52 | 53 | 54 | Now, often, the precsion-recall curve is zig-zag and is not monotonically decreasing, due to the stochastic nature of black-box models, but in order to evaluate the area under the curve, we want to make the curve monotonically decreasing as shown in the image. For this, we can use two methods. 55 | 56 | ![zz](https://miro.medium.com/max/443/1*7oGMy1fbcWw_jQttX0MNdw.png) 57 | 58 | 1. All points interpolation 59 | 2. 11 point interpolation. 60 | 61 | Normally, we use the 11 point interpolation, In this point, we pick 11 equally spaced recall values, namely, 0.0, 0.1,.....1.0. We consider the precision against these points. If suddenly precision drops, we consider the maximum of the values on the right hand, to keep the curve monotonically non-increasing. We just take the Average of the precision of the 11 corresponding recall values. 62 | 63 | ![11](https://miro.medium.com/max/463/1*4lx2r2HUx0wRbQOjWMcAZA.png) 64 | 65 | #### Mean Average Precision 66 | 67 | Often, we do not detect only one type of object using one model. We detect multiple objects. For every object we get an Average Precision, the Mean Average Precision (mAP) is the mean of AP for all the objects. For models, which detect only one object, mAP=AP. 68 | 69 | #### Variations of mAP: 70 | 71 | Several Papers used several version of mAP. Some use 0.5 mAP with 0.5 as threshold for IOU. Similarly, some use 0.75 mAP, others 0.95 mAP. Others have also used Large mAP, i.e, the performance is measured based only on larger objects. Similarly, we have samll mAP. 72 | 73 | References: 74 | 75 | 1. https://github.com/rafaelpadilla/Object-Detection-Metrics 76 | 2. https://blog.zenggyu.com/en/post/2018-12-16/an-introduction-to-evaluation-metrics-for-object-detection/ 77 | 78 | 79 | 80 | 81 | 82 | 83 | 84 | 85 | 86 | 87 | -------------------------------------------------------------------------------- /Applications/Object Detection/RCNN/Fast-RCNN/README.md: -------------------------------------------------------------------------------- 1 | ### Fast R-CNN 2 | 3 | R-CNN algorithms had a number of issues: 4 | 1. It had to propose 2000 candidates using selective search and train the neural network seperately on the regions. and then generate feature maps seperately, so it was very time consuming 5 | 2. The R-CNN algorithm needed to train three different algorithms CNN for obtaining the feature maps, SVM for the classification and a Regression model for finding the bounding boxes. 6 | 7 | The author of R-CNN, Ross Girshick propoposed, a new and modified network in order to solve the above problems with the R-CNN network. 8 | 9 | #### Architecture: 10 | 11 | ![arch](https://lilianweng.github.io/lil-log/assets/images/fast-RCNN.png) 12 | 13 | The above image shows the architecture of the Fast R-CNN algorithm. This algorithm also uses selective search algorithms alike R-CNN (discussed earlier). The Fast R-CNN consists of a CNN (usually pre-trained on the ImageNet classification task) with its final pooling layer replaced by an “ROI pooling” layer and its final FC layer is replaced by two branches — a (K + 1) category softmax layer branch and a category-specific bounding box regression branch. 14 | 15 | #### Algorithm: 16 | 17 | 1. In the initial step, the image is passed through a basic pre-trained backbone network like VGG with a stride of 16 just like the R-CNN network. The initial size of the image is 1000 x 600 as was in the case of R-CNN also. The last layer of the backbone is 512. So, the size of the feature map is 1000/16 x 600/16 x 512 = 60 x 40 x 512. 18 | 2. Similar to the RCNN, the projection of the Region of Interests from the actual image to the feature maps is done by projecting the centre of the actual image to a pixel on the feature map. 19 | 3. Next, similar to RCNN, we use a Selective Search algorithm to find the target region of interest, but this time on the feature maps. The selective search algorithms generate about 2000 ROI or target regions. 20 | 4. Now, to feed the Connected layers, the ROI feature maps must be equal in size. To solve this problem, we use a specially designed max pooling layer called the ROI pooling layer. This layer was introduced as a modification from the SPP-Net. It works by dividing the whole image into a N number of sub-division or parts, and we use max-pooling in each of the parts, so, we get N values, one from each part. The size of the output is therefore equal to the number of parts we want to make, irrespective of our original input size. If we want to obtain a final image of dimension H x W, we just create H x W divisions. If our input image is of dimension h x w, and we need to convert it to H x W, each of the parts need to be of size (h/H x w/W), and we apply Max pooling on each part, finally obtaining a HxW image. In this H and W equals to 7. So, as output of ROI pooling layer we obtain an output of dimension of 7x7x512. 21 | 22 | ![ROI](https://miro.medium.com/max/362/1*qhN4EKjJO1hxKfpADOkcaw.png) 23 | 24 | 5. We pass the feature map from the ROI pooling layer into a set of fully connected layers, after flattening. Each of the FC has 4096 nodes. 25 | 6. Then finally the output from the FC layers are fed to a classification softmax layer with N+1 nodes, the extra node is for background class and a Regression branch for the bounding boxes output. 26 | 27 | ![Final](https://miro.medium.com/max/770/1*jYDMaYeH-TrcoofDqCdxug.jpeg) 28 | 29 | #### Training 30 | 31 | For training mini batch is used. The batch consists of 2 images having 64 regions each. 25% of the ROIs are object proposals that have at least 0.5 IoU with a ground-truth bounding box of a foreground class. These would be positive for that particular class and would be labelled with the appropriate u = 1 to K. The remaining ROIs are sampled from proposals that have an IoU with groundtruth bounding boxes between \[0.1, 0.5). These ROIs are labelled as belonging to the class u = 0 (background class). 32 | 33 | 34 | #### Loss Function 35 | 36 | The loss function used here is a multitask loss given by: 37 | 38 | ![loss](https://miro.medium.com/max/481/1*R9hDmwuDDnP_LVIUWsuhsQ.png) 39 | 40 | The loss is very similar to RCNN. The first part Lcls is the classification Log Loss for each class and the Second part is a Smooth L1 regression loss as given by the equaltion. The second part is only applicable for the object classes, i.e, from u>0, as u=0 is the background class, the loss is not applicable. 41 | 42 | References: 43 | 44 | 1. https://arxiv.org/pdf/1504.08083.pdf 45 | 2. https://towardsdatascience.com/fast-r-cnn-for-object-detection-a-technical-summary-a0ff94faa022 46 | 3. https://medium.datadriveninvestor.com/review-on-fast-rcnn-202c9eadd23b 47 | 4. https://deepsense.ai/region-of-interest-pooling-explained/ 48 | 49 | 50 | 51 | 52 | -------------------------------------------------------------------------------- /Applications/Object Detection/RCNN/Faster-RCNN/README.md: -------------------------------------------------------------------------------- 1 | ### Faster RCNN 2 | 3 | The Fast RCNN worked fine, but in real time environments, we needed a faster algorithm, which can detect objects very quickly. The speed of the Fast RCNN was capped due to selective search algorithm which served as the botteleneck. The Faster RCNN was proposed by Shaoqing Ren and Ross Girshick, where the authors replaced the Selective Search algorithm with a network called RPN or Region Proposal Network. They let the network learn to find out the best target regions which may contain the objects in the image. 4 | 5 | #### Architecture 6 | 7 | ![Arch](https://miro.medium.com/max/770/1*pSnVmJCyQIRKHDPt3cfnXA.png) 8 | 9 | The algorithm introduces the Region Proposal Network which is discussed earlier. The RPN selects the target regions which are passed on to the ROI pooling layers, where the classifiers and the bounding box regressors predict the class of the images and the bounding boxes. 10 | 11 | #### Algorithm 12 | 13 | The image is passed through the convolutional pre-trained backbone and the feature maps are obtained. The image dimensions and the backbone network is same as Fast RCNN. The RPN gives the proposals as previously discussed which are sent to the RoI pooling layer. The operations, loss and training further are similar to that of Fast RCNN. 14 | 15 | #### Full Structure 16 | 17 | ![Struct](https://miro.medium.com/max/770/1*tTqg3W165itg-LVRFxHJfA.jpeg) 18 | 19 | References: 20 | 21 | 1. https://arxiv.org/pdf/1506.01497.pdf 22 | 2. https://towardsdatascience.com/faster-r-cnn-for-object-detection-a-technical-summary-474c5b857b46 23 | 3. https://tryolabs.com/blog/2018/01/18/faster-r-cnn-down-the-rabbit-hole-of-modern-object-detection/ 24 | 25 | -------------------------------------------------------------------------------- /Applications/Object Detection/RCNN/R-CNN/README.md: -------------------------------------------------------------------------------- 1 | ### Regional Convolutional Neural Networks (R-CNN) 2 | 3 | R-CNN was one of the first major approaches to solve the problem of object detecttion. The approach was first proposed by Ross Girshick of UC Berkeley in the paper titled "Rich feature hierarchies for accurate object detection and semantic segmentation". We previouly had image classification algorithms which can handle only one object at a time, but as we have already seen in case of object detection, we need to detect multiple objects in an image, we needed something which can give us the regions where objects may be found and then pass these regions to get the object classifications. 4 | 5 | To solve the above mentioned problem Girshick proposed R-CNN. 6 | 7 | #### Architecture 8 | 9 | ![arch](https://lilianweng.github.io/lil-log/assets/images/RCNN.png) 10 | 11 | The above image depicts the proposed architecture. The architecture proposes the use of a selective search algorithm for the proposal of the target regions that may contain objects in the image. 12 | 13 | #### Algorithm: 14 | 15 | 1. Initially, A selective search algorithm is applied on the input image. Selective search algorithm has been previously discussed. 16 | 2. The Selective search algorithm provides 2000 region proposals for every image. 17 | 3. Then we use a pre-trained Convolutional Neural Network on these 2000 regions to obtain the feature maps for the classifying Network. 18 | 4. The pre-trained nework is "AlexNet". The pre-trained network is fine tuned further to best fit the domain of the dataset. 19 | 5. A final classification softmax layer with (N+1) classification nodes, N for each target class and 1 for the background. 20 | 6. As we know CNN needs a fixed shape, so the 2000 extracted regions are warped in a fixed size and sent as input to the network for training. 21 | 7. For the target labels, every region, which has an IOU overlap of 0.5 with the actual ground truth is taken as positive sample, else taken as negative sample. 22 | 8. For more robust training strategies like **Non-Max Suppression** and **Hard Negative Mining** have been used along with the use of a low learning rate and mini batches. 23 | 9. After training the model, the classification softmax layers are removed, in order to generate just the feature maps. 24 | 10. After generating the feature maps, the feature maps are flattened to generate 4096-d vectors, which are sent to N seperate binary SVMs, one for each class. 25 | 11. For training the SVMs, the proposed regions with IOU >=0.3 are taken as positive samples. 26 | 12. To reduce the localization errors, a regression model is trained to correct the predicted detection window on bounding box correction offset using CNN features. 27 | 28 | ![Model](https://miro.medium.com/max/612/1*NX5yYTi-eQjP0pMWs3UbUg.png) 29 | 30 | The bounding box coordinates are predicted by an regression model. So, its training is done using some custom multitask loss function given by: 31 | 32 | ![Loss](https://miro.medium.com/max/2142/1*PUJocoWsZIHZltbQeXsqaw.png) 33 | 34 | The first equation shows how the predicted and the ground truth box are predicted. (px,py) give the coordinates to the center of the box and (pw,ph) give width and hieght. The equation 2 and 3 show the transformations between the Ground truth to prediction and vice versa transformations. 35 | 36 | Fourth equation is the actual regression loss functtion, which is a MSE or least square error, with a regularization term (L1) to prevent the weights w from attaining very large velue and overfit. The importance factor *lambda* is selected using cross-validation. For the ground truth, in case of regression, we consider boxes with IOU overlap >= 0.6. 37 | 38 | 39 | References: 40 | 41 | 1. https://arxiv.org/pdf/1311.2524.pdf 42 | 2. https://towardsdatascience.com/r-cnn-for-object-detection-a-technical-summary-9e7bfa8a557c 43 | 44 | 45 | -------------------------------------------------------------------------------- /Applications/Object Detection/RCNN/README.md: -------------------------------------------------------------------------------- 1 | ### The RCNN Family 2 | 3 | The RCNN family consists of 4 algorithms, namely: RCNN, Fast RCNN, Faster RCNN, and Mask RCNN. They are classifiction based 2 step object detection processes. 4 | 5 | RCNN families has some dependencies like Selective Search Algorithm in RCNN and Fast RCNN and Region Proposal Algorithm for Faster RCNN and Mask RCNN. The dependencies are seperately discussed. 6 | -------------------------------------------------------------------------------- /Applications/Object Detection/RCNN/Region Proposal Network/README.md: -------------------------------------------------------------------------------- 1 | ### Region Proposal Network 2 | 3 | In the RCNN and fast RCNN networks, Selective Search algorithm was used to obtain region proposals. But, the algorithms were very slow as it implemented a segmentation underneath. So, in the faster RCNN networks, the authors replaced the Selective Search algorithm with a much faster RPN network. 4 | 5 | The RPN itself is a neural network based architecture which is responsible for obtaining the target portions of the image which may contain object. These portions of the images are sent to the classifier in order to classify the objects in the image. The algorithm is much more effective and fast than the sliding window and Selective Search algorithms. 6 | 7 | #### Architecture and Working 8 | 9 | ![Arch](https://miro.medium.com/max/770/1*S_-8lv4zP3W8IVfGP6_MHw.jpeg) 10 | 11 | The diagram above shows the given architure of the RPN network. 12 | 13 | It initially passes the input image through a basic backbone block like VGG networks, which identify the basic features of the image. The backbone network generates a feature map of the image for the RPN network to identify the targets. The bakbone network stride is kept equal to 16, i.e, if two pixels are 16 pixels apart in the main image, the whole 18 pixels will be represented by 2 corresponding pixels on the feature map. 14 | 15 | After the backbone layer, we obtain a M x N sized feature maps. The center of each area of the original image is represented by the coordinates of a pixel on this feature map. Now, one element of concern is the objects in the image can be of any size and shape. To solve this the idea of anchor boxes were introduced. 16 | 17 | ![Anchor](https://miro.medium.com/max/401/1*muU7TGhE7ke1nqnHQbbdKg.jpeg) 18 | 19 | In order to detect the objects of different sizes and shapes we use bounding boxes (called anchor boxes, which actually gives the area of interest) of different sizes and aspect ratio. The authors used total of 3 different sizes, and for each different size, a total of 3 different aspect ratios have been used, which gives a total of 9 anchor boxes. Normally, we have an input image of 10000 x 6000 size which is reduced to 10000/16 x 6000/16 after passing through backbone. We pass anchor boxes of sizes 128 x 128, 256 x 256 and 512 x 512. 20 | 21 | Initially, we use a sliding window protocol to generate the anchor boxes, so we obtain a large number of anchor boxes. Primarily we remove all the cross boundary anchor boxes. 22 | 23 | For each anchor box left after primary removal, a classification score is obtained from the network. The RPN uses a two class classification. It predicts whether an anchor box contains an object or not. It doesnot specify which type of object. The classification provides us with a confidence score. If multiple anchor boxes indicate a particular set of pixels, we keep only the anchor wiith maximum confidence score, rejecting the others. This is called **Non-max suppression**. 24 | 25 | The model also predicts bounding box coordinates (tx,ty,th,tw), which gives the position of the area of interest. (tx,ty) is the center of the area and (tw,th) represents hieght and width of the box. These values are predicted using regression. 26 | 27 | So, finally, we get a number of bounding boxes with their confidence scores and the coordinates. 28 | 29 | #### Training 30 | 31 | We consider the predicted bounding boxes with IOU > 0.7 with the actual bounding boxes as Positive samples and the bounding boxes with IOU < 0.3 are considered as negative samples. This is done to balance the number of positive and negative samples. 128 positive and 128 negative samples are randomly selected to form the batch, and training of the model is done. The butraining of the RPN consists of a multi-task loss fucntion, as it is a combination of classification and a regression losses. It is gven by: 32 | 33 | ![Loss](https://miro.medium.com/max/678/1*mq0jKvp4I1o1ecHBTzQLHA.png) 34 | 35 | Here i is the index of the anchor in the mini-batch. The classification loss L𝒸ₗₛ(pᵢ, pᵢ*) is the log loss over two classes (object vs not object). pᵢ is the output score from the classification branch for anchor i, and pᵢ* is the groundtruth label (1 or 0). 36 | The regression loss Lᵣₑ(tᵢ, tᵢ*) is activated only if the anchor actually contains an object i.e., the groundtruth pᵢ* is 1. The term tᵢ is the output prediction of the regression layer and consists of 4 variables \[tₓ, tᵧ, tw, tₕ] It is given by: 37 | 38 | ![reg](https://miro.medium.com/max/769/1*s6uMkokhzyGoMLSXLMm9Ug.png) 39 | 40 | Here x, y, w, and h correspond to the (x, y) coordinates of the box centre and the height h and width w of the box. xₐ, x* stand for the coordinates of the anchor box and its corresponding groundtruth bounding box. 41 | 42 | The *lambda* parameter is used to stabilize the balance of the two losses. Normally the value is kept equal to 10. 43 | 44 | ![out](https://miro.medium.com/max/375/1*JDQw0RwmnIKeRABw3ZDI7Q.png) 45 | 46 | So, finally we flatten the anchor boxes as a 256-d 1D vector and send to the dense layers from where we get an output of 2 x k confidence scores from the classification layer and a 4 x k coordinates from the regression layers, for k chosen anchor boxes. 47 | 48 | ##### Hard Negative Mining 49 | 50 | It is a strategy which are used in most of such algorithms, in order to balance the number of positive and negative samples. When initially we obtain a lot of anchor boxes, its very easy to get the Positive samples, using the overlap with the original bounding boxes. But, for the negative samples, we have a huge number of anchor boxes, which if all are considered will cause a huge imbalance. To solve this we do not consider the boxes, which cover absolutely dark backgrounds or have 0 overlap with the actual boxes. We consider such boxes, which cover interesting points with some variations in the feature maps but actually are not object, to make our model more robust. 51 | 52 | References: 53 | 54 | 1. https://www.geeksforgeeks.org/faster-r-cnn-ml/ 55 | 2. https://medium.com/@nabil.madali/demystifying-region-proposal-network-rpn-faa5a8fb8fce 56 | 3. https://towardsdatascience.com/faster-r-cnn-for-object-detection-a-technical-summary-474c5b857b46 57 | 58 | 59 | 60 | 61 | 62 | 63 | 64 | 65 | 66 | 67 | -------------------------------------------------------------------------------- /Applications/Object Detection/RCNN/Selective Search/README.md: -------------------------------------------------------------------------------- 1 | ### Selective Search 2 | 3 | Before we move into the details of selective search, we have to look at another algorithm **Felzenszwalb’s Algorithm**, which is a image segmentation algorithm, on which the selective algorithm is built. 4 | 5 | 6 | #### Felzenszwalb’s Algorithm based Image Segmentation 7 | 8 | In case of image classification here is only one object, but for object detection and image segmentation we have more than one objects in an image. For multiple object detection we need to identify the regions that potentially contains an object. To solve this, a graph based segmenting method was proposed in order to mark and segment the similar portions of the image. 9 | 10 | We use a undirected graph G=(V,E) to represent an input image. One vertex vi ∈ V represents one pixel. One edge e=(vi,vj) ∈ E connects two vertices vi and vj. Its associated weight w(vi,vj) measures the dissimilarity between vi and vj. The dissimilarity can be quantified in dimensions like color, location, intensity, etc. The higher the weight, the less similar two pixels are. A segmentation solution S is a partition of V into multiple connected components, {C}. Intuitively similar pixels should belong to the same components while dissimilar ones are assigned to different components. 11 | 12 | The graphs are created based on Nearest Neighbour policy. Each pixel is a point in the feature space (x, y, r, g, b), in which (x, y) is the pixel location and (r, g, b) is the color values in RGB. The weight is the Euclidean distance between two pixels’ feature vectors. 13 | 14 | We need to know 2 distance metrics to create the graph. 15 | 16 | 1. Internal distance: Int(C)=max(w(e)), where e ∈ MST(C,E) MST is the minimum spanning tree of the components. A component C can still remain connected even when we have removed all the edges with weights < Int(C). 17 | 18 | 2. Difference between two components: Dif(C1,C2) = min ( w(vi,vj)). Dif(C1,C2) = ∞ if there is no edge in-between. 19 | 20 | 3. Minimum internal difference: MInt(C1,C2)= min (Int(C1)+τ(C1), Int(C2)+τ(C2) ), where τ(C)=k/|C| helps make sure we have a meaningful threshold for the difference between components. With a higher k, it is more likely to result in larger components. 21 | 22 | If the weights of two vertices belonging two different components C1,C2 is less than MInt(C1,C2), we merge the two components. 23 | 24 | Algorithm: 25 | 26 | 1. Edges are sorted by weight in ascending order, labeled as e1,e2,…,em. 27 | 2. Initially, each pixel stays in its own component, so we start with n components 28 | 3. Repeat for k=1,…,m: 29 | 4. The segmentation snapshot at the step k is denoted as Sk. 30 | 5. We take the k-th edge in the order, ek=(vi,vj). 31 | 6. If vi and vj belong to the same component, do nothing and thus Sk=Sk−1. 32 | 7. If vi and vj belong to two different components C1 and C2 as in the segmentation Sk−1, we want to merge them into one if w(vi,vj) ≤ MInt(C1,C2); otherwise do nothing. 33 | 34 | 35 | ![Image_ex](https://lilianweng.github.io/lil-log/assets/images/image-segmentation-indoor.png) 36 | 37 | #### Selective Search Algorithm: 38 | 39 | Selective Search algorithm is used to generate target regions which can contain the objects in the image. It initially uses the Felzenszwalb’s Algorithm based Image Segmentation for producing the similar regions. 40 | 41 | 1. At the initialization stage, apply Felzenszwalb and Huttenlocher’s graph-based image segmentation algorithm to create regions to start with. 42 | 43 | 2. Use a greedy algorithm to iteratively group regions together: 44 | ->First the similarities between all neighbouring regions are calculated. 45 | ->The two most similar regions are grouped together, and new similarities are calculated between the resulting region and its neighbours. 46 | 47 | 3. The process of grouping the most similar regions (Step 2) is repeated until the whole image becomes a single region. 48 | 49 | The selective searches the proposed basic properties of images as similarity metrics between two regions ri and rj, like, 50 | 51 | 1. Colour 52 | 2. Texture 53 | 3. Size 54 | 4. Shape 55 | 56 | By tuning the threshold k in Felzenszwalb and Huttenlocher’s algorithm, changing the color space and picking different combinations of similarity metrics, we can produce a diverse set of Selective Search strategies. 57 | 58 | 59 | References: 60 | 61 | 1. http://cs.brown.edu/people/pfelzens/papers/seg-ijcv.pdf 62 | 2. http://www.huppelen.nl/publications/selectiveSearchDraft.pdf 63 | 3. https://www.pyimagesearch.com/2020/06/29/opencv-selective-search-for-object-detection/ 64 | 4. https://lilianweng.github.io/lil-log/2017/10/29/object-recognition-for-dummies-part-1.html 65 | -------------------------------------------------------------------------------- /Applications/Object Detection/README.md: -------------------------------------------------------------------------------- 1 | ### Object Detection 2 | 3 | It is one of the major tasks in the computer vision domain. Till now we have done image classification, that is we have answered the question, **"what is the image?"**. Then we started looking at the question **"Where is the object in the image?"**, through image localization concepts. Object detection problem attempts to answer both the questions **where** and **what** together, i.e, both localization and classification. Now, in classification, we can have only one object in the image. There can be multiple objects in the image in case of object detection algorithms. For every object present in the image, we need to find where the object is present and what the object is. 4 | 5 | ![Ex](https://miro.medium.com/max/1600/1*u3SAkFbIQHBC8hSifHUMeA.jpeg) 6 | 7 | ### Algorithms 8 | 9 | To accomplish this task, we have uncovered several algorithms over the time. Major algorithms are: 10 | 11 | 1. R-CNN: Regional-Convolutional Network 12 | 2. Histogram of Oriented Gradients(HOG) 13 | 3. R-FCN: Regional-Fully Convolutional Network 14 | 4. Fast R-CNN 15 | 5. Faster R-CNN 16 | 6. Mask R-CNN 17 | 7. Spatial Pyramid Pooling (SPP-net) 18 | 8. Single Shot Detector (SSD) 19 | 9. You Only Look Once (YOLO) 20 | 10. Retinanet 21 | 11. MobileNet (For embedded and mobile devices) 22 | 23 | HOG is one the traditional descripter based method. Others are majorly divided into two classes based on the methodology of the algorithms: 24 | 25 | **1. Classification based methods:** Methods like RCNN, Fast RCNN, Faster RCNN amd SPP-Net models the Object Detection problem as classification problem . So, in this case, we take windows of fixed side from input images, from locations that may contain objects, and feed the patches to the image classifiers. so, we know the location through the windows and the classifier replies the "what" question. So, these are two step detectors. On first stage, the algorithm selects the proposed regions where objects are present and on the second stage it deploys the classifier on the regions proposed. 26 | 27 | **2. Regression based methods:** Methods like YOLO and SSD formulate the object detection problem differently. These classifiers detects and classifies the objects on an image on a single shot. Instead of selecting regions or patches, these algorithms detects images and places bounding boxes all over the image. 28 | 29 | -------------------------------------------------------------------------------- /Applications/Object Detection/Retinanet/README.md: -------------------------------------------------------------------------------- 1 | #### RetinaNet 2 | 3 | RetinaNet was proposed by Tsung-Yi Lin at Facebook AI Research in the year 2018. RetinaNet was proposed as a one stage detector like SSD and YOLO. The RetinaNet introduced a Focal loss inorder to deal with the imbalance in the data samples while training object detection networks. 4 | 5 | There are two types of negatives in detection problems, "easy negatives" (the plain background classes, without any type of noise) which are detected very easily by the detectors, and the "hard negetives" (having noise in the background) which are not detected easily. We need to train the model harder on the hard negtives rather than the easy ones. Other than that there is a huge class imbalance between the foreground and the background detections in the object detections. Several methods like SSD use methods like "hard negative mining" but when a large number of bounding boxes are predicted, it is impossible to handle the imbalance using plain bootstrapping. 6 | 7 | To handle this RetinaNet, introduced the concept of **focal loss**. 8 | 9 | #### Focal loss 10 | 11 | RetinaNet was able to detect about 100k boxes by resolving the class imbalance problem using focal loss. 12 | 13 | For the normal Cross-Entropy we use, 14 | 15 | **C.E = - ylog(p) - (1-y)log(1-p).** 16 | 17 | so for y=1 we get the first part and if y=0 we go for the 2nd part. But in this case, we focus on both the classes equally plus the loss contributions from the easy negatives are much much bigger than the hard negative ones, due to the imbalance between the foreground and background. The hard negatives are mostly obtained from places which are foreground but are not detectable objects. This imbalance leads to overshadowing of the "hard-negative classes". 18 | 19 | One of the methods to deal with the problem was the alpha-balanced loss equation given by: 20 | 21 | **alpha weighted C.E= alpha\*y\*log(p) - (1-alpha)\*(1-y)\*log(1-p).** 22 | 23 | Now the alpha parameter is the ratio between the frequency of the over-represented class and the under represented class. So, larger the value of alpha more weighted the error is for the under-represented class, and less weighted is the higher represented background class. In real sceneraio, alpha is found using cross validation. 24 | 25 | The idea of focal loss went on to penalize and train, focussed more on the hard negatives training. They replaced the loss functions as: 26 | 27 | **Focal loss= (1-p)^gamma\*y\*log(p) - p^gamma\*(1-y)\*log(1-p).** 28 | 29 | So, if the examples are easy negatives, there probability of prediction is close to zaro, so log(1-p) tends to be 0, but for hard negatives, the model tends to predict them, as 1 so, the value of log(1-p) tends to be a higher negtive number, and more the value of the predicted probability more is the weightage due to the p^gamma term. How much we want to penalize is controlled by the gamma parameter. It has values from \[0,5]. Similarly, for the easy ones the value of predicted probability is less, and thus so is the weightage. 30 | 31 | The variant used in RetinaNet is the alpha-variant of the Focal loss, which made it extremely customized due to the alpha and gamma control variables. It is given by: 32 | 33 | **alpha variant Focal loss= alpha\*(1-p)^gamma\*y\*log(p) - (1-alpha)\*p^gamma\*(1-y)\*log(1-p).**\ 34 | 35 | This lets the loss function focus on the imbalance and the hard negatives at the same time. 36 | 37 | #### RetinaNet Training 38 | 39 | The RetinaNet used the alpha-trained Focal loss for the classification loss, and L1-smooth loss for the bounding box regression, just like SSD and fast-RCNN. So, these two combined make up, the multi-task loss function for the RetinaNet. 40 | 41 | Total loss is given by: 42 | 43 | Loss= lambda\*localization loss + classification loss. 44 | 45 | where lambda is the weight control parameter. 46 | 47 | #### Matching 48 | 49 | In case of RetinaNet, we use the predicted boxes, that have IOU>=0.5 with the grund truth as positive and the ones with IOU<=0.4 as negative and ignore from 0.4 to 0.5. For the range IOU>=0.5 the classification loss is considered. 50 | 51 | #### Architecture of RetinaNet 52 | 53 | RetinaNet uses a combination of Feature Pyramid Network and ResNet as backbone network for prediction. 54 | 55 | ![Arch](https://blog.zenggyu.com/post/2018-12-05/fig_1.jpg) 56 | 57 | The above network gives the architectural details of the RetinaNet. FPN is built on top of a ResNet, which serves as the deep feature extractor and the bottom up pathway of the FPN network. The top down pathway is used to predict and detect the object at different resolutions, which help the model detect different size of objects. 58 | 59 | ![Arch_FPN](https://lilianweng.github.io/lil-log/assets/images/featurized-image-pyramid.png) 60 | 61 | The above architecture was proposed for FPN. The architecture of RetinaNet remains similar, but in case of ResNet there are 5 layers in both the bottom up and top down paths. So, we get C5 also, on which we apply 3x3 convolutions to obtain P5 and, then we apply a 3x3 conv with stride 2 on C5 again to obtain P6. We further apply 3x3 conv,with stride 2 and ReLU on P6 to get P7. P3 to P7 are used for predictions to give multi-scale detections, 62 | 63 | The upper layers are smaller in dimansion and robust to detect large objects and vice versa. All the feature maps are sent to the classification and regression box subnets for further detection. The classification subnet gives an output of W x H x KA and the regression subnet gives an output of W x H x 4A, where W and H are width and hieght of the feature maps on which the predictions are done, A is the number of anchor boxes and K is the number of classes we want to predict. 64 | 65 | The authors have considered 9 anchor boxes per level. The size ratios used are 2^0, 2^(1/3), 2^(2/3) and each size has three aspect ratios 1/2,1 and 2. 66 | 67 | References: 68 | 69 | 1. https://arxiv.org/pdf/1708.02002.pdf 70 | 2. https://arxiv.org/pdf/1612.03144.pdf 71 | 3. https://towardsdatascience.com/review-retinanet-focal-loss-object-detection-38fba6afabe4 72 | 4. https://blog.zenggyu.com/en/post/2018-12-05/retinanet-explained-and-demystified/ 73 | 5. https://medium.com/@14prakash/the-intuition-behind-retinanet-eb636755607d 74 | 75 | 76 | 77 | -------------------------------------------------------------------------------- /Applications/Object Detection/SPP_net/README.md: -------------------------------------------------------------------------------- 1 | ### SPP-Net (Spatial Pyramidal Pooling) 2 | 3 | SPP-Net was proposed by Kaiming He in the year 2014 from Microsoft. It mainly focused on two ideas. The first one is that, it was seen that, in order to obtain predictions, we need to fix the size of the input image, so the authors came up with a way to make the model independent of the input size. So, the model could predict on any size of the input image. The second one is that the authors wanted to deploy the idea of image or feature pyramids using pooling layers, which will allow creation of a multi-scale featture map, and thus help to obtain more robust features. 4 | 5 | The authors proposed at the transisition level between the convolutional layers and the fully connected layers, mutiple pooling layers must be used with completely varying scales. 6 | 7 | #### SPP Layer 8 | 9 | ![SPP module](https://miro.medium.com/max/1176/1*Af0rCJ67rVYdfIfhwnwi3A.png) 10 | 11 | The SPP module is represented in the above image. As we can see the feature maps are obtained after going through the backbone network. The feature maps are of arbitrary dimension of say MxN and has 256 channels. Now, a set of pooling layers are applied with different scales, to obtain different feature maps. As we can see the 1st feature map is pooled to have one value for each channel. So, for 256 channels, there are 256 such values, which are flattened to obain a 1d vector of 1 x 256 values. The 2nd feature map is pooled such that the resultant feature map can contain 4 values for a channel, so it is flattened to create a 4 x 256d vector. Similarly, for the third scale we obtain a 16 x 256d vector. 12 | 13 | All the vectors are concatenated to form a 1-d vector. Now, this vector is sent to the predicting layer, but it has all multiscale information. And as we create fix length 1d vectors, we donot need to worry about the size of the input image size. 14 | 15 | So, how is this achieved? It has been shown by the authors by global pooling and proper selection of the window of the pooling layers and stride we can create output features as we want. The authors have shown, if we want an output feature map of \[n x n] and we have an input of size \[a x a]. We must select a window size of cieling(a/n) and stride of size floor(a/n). 16 | 17 | #### Multiview Training and Testing 18 | 19 | The authors used a pre-trained ZFNet, AlexNet and Overfeat networks with SPP layer and found improvement in performance. To train for multi view or prediction on different size of images, authors trained the networks initially on 224 x 224 size data and changed to 180 x 180 size input images randomly after some epochs. The trainings shared same parameters. During testing 6 scales were used {224, 256, 300, 360, 448, 560}. 20 | 21 | The authors also used a 4-level SPPnet pyramid with {6×6, 3×3, 2×2, 1×1} pooled feature maps. These showed improved performance. 22 | 23 | #### SPPNet in Object Detection 24 | 25 | ![Arch](https://miro.medium.com/max/1374/1*EMhHR_g4UWEYpxsVWdpKdA.png) 26 | 27 | For object detection, SPPNet works very similar to RCNN networks. It uses a Selective search algorithm to obtain the region proposals. The only difference is in SPPNet the backbone layer is applied on the input image and feature maps are obtained. On these feature maps we use selective search algorithms, get the region proposals, which are sent to the spp layers to obtain a fixed size vector. For each vector, we have a SVM for the prediction and a bounding box regressor. 28 | 29 | References: 30 | 31 | 1. https://arxiv.org/pdf/1406.4729.pdf 32 | 2. https://medium.com/coinmonks/review-sppnet-1st-runner-up-object-detection-2nd-runner-up-image-classification-in-ilsvrc-906da3753679 33 | 34 | Deeper understanding: 35 | 36 | https://towardsdatascience.com/understanding-sppnet-for-object-detection-and-classification-682d6d2bdfb 37 | 38 | 39 | -------------------------------------------------------------------------------- /Applications/Object Detection/SSD/README.md: -------------------------------------------------------------------------------- 1 | ### Single Shot Detector 2 | 3 | It is one of the earliest attempts in order to build an one stage object detector, i.e, without the region proposal part as in the two stage detectors in the RCNN family. The concept was put forward by Wei Liu in the paper titled "SSD: Single Shot MultiBox Detector". The algorithm was fast and meant for real time object detection. 4 | 5 | Instead of using a region proposal algorithm to propose a region and using classification algorithms on the proposed regions, Single Shot detectors the sent the image through a backbone network and generated a feature map, divided it into grids and applied detection and classification on each of the grids simultaneously in a single stage. 6 | 7 | #### Architecture 8 | 9 | ![Arch](https://miro.medium.com/max/2464/1*hdSE1UCV7gA7jzfQ03EnWw.png) 10 | 11 | The architecture of the single shot detector is given in the above image. It has a base network which can be any pretrained network, that functions like a backbone, and extracts a feature map from the image. In the paper, the used backbone in VGG-16 network. The VGG-16 network provides the feature map of size 38 x 38 x 1024 after the Conv_5_3 or the final layer of the network. The authors have added 6 convolutional layers after the backbone network in order to downsample the feature map and extract more robust features. 12 | 13 | SSD make predictions from 6 levels of the feature maps. This form a image pyramid type structure. The base feature obtained from the Conv_4_3 layer of the backbone network, of size 38 x 38 gives the first set of detections. Then the map obtained from the backbone is downsampled using 6 more convolutional layers. The first obtained feature map is of dimension 19x19, which gives the second set of detections. We further apply convolutions on it to obtain another feature map of dimension 10x10, which gives the third set, and then we get 5x5,3x3 and 1x1 size feature maps correspondingly. 14 | 15 | ![Pyramid](https://miro.medium.com/max/700/1*sZUWR2XgCAJ6AXM5NXYjNg.png) 16 | 17 | The larger feature maps are responsible for the detection of the finer features and thus the samller objects, while the downsampled feature maps having more globally robust features are responsible for the detection of the larger objects in the image. 18 | 19 | 20 | #### Localization: 21 | 22 | The SSD algorithm, like YOLO also divides the feature maps into grids, the first feature map in 38x38, the second one in 19x19 and so on. The SSD algorithm uses anchors (called default box here) just as YOLO and Faster RCNN. The anchors are responsible for detecting the position of the objects. The authors have used a k-means clustering method to determine the number of anchor boxes to be used for a particular grid. The authors decided to use 4 anchor boxes for few level of predictions out of the 6 and 6 anchor boxes for other levels. The number of anchor boxes for each level were chosen based on the results of the k-means clustering. 23 | 24 | ![anchor](https://miro.medium.com/max/2400/1*vNaiiFUVwCfzx1znKiFYYw.jpeg) 25 | 26 | So, for the 38x38 feature map we have 4 anchor boxes for each grid. For the 19x19 feature map size level we have 6 boxes for each grid each, then again per grid for 10x10 feature map level, 6 for 5x5 level, and 4 for 3x3 and 1x1. So, total we the SSD outputs predictions for 8732 anchor boxes. The anchor boxes vary in scales and aspect ratios, with respect to the size of the feature maps used on the level. The aspect ratio is fixed to some values which help the anchor boxes initiate with a size that can detect objects easily, and helps the network to obtain stability. The 5 aspect ratios and there corresponding hieght and widths are: 27 | 28 | ![ar](https://miro.medium.com/max/558/1*vI4qQt5MRNTtOA1ODyKbEA.png) 29 | 30 | 1:1 is used as the 6th aspect ratio. So, we obtain 6 anchor boxes and for the levels with 4 anchor boxes the 1/3 and 1 aspect ratios are dropped. And sk is the scaling factor for the anchor boxes, given by: 31 | 32 | ![scale](https://miro.medium.com/max/868/1*bX2dGa3mgbO5Fdwv8biQUg.png) 33 | 34 | Where k denotes the level of the feature map. Smin is the minimum scale 0.2 and Smax is the maximum scale 0.9 and the total number of levels m=6. 35 | 36 | #### Prediction 37 | 38 | For each anchor box, we obtain the 4 details values for the bounding boxes (∆cx,∆cy,w,h), 21 class probability values, 20 for the specific classes and 1 for the background class. Bounding box with shape offset. ∆cx, ∆cy, h and w, representing the offsets from the center of the anchor box and its height and width So,each of the anchor boxes predicts 25 values. So, in the first level, we get 38 x 38 x 4 x (21 + 4) predictions. We obtain the feature maps of NxNxM directions, where M is the number of channels and used 3x3 convolutions to obtain the results. 39 | 40 | SSD predictions are classified as positive matches or negative matches. SSD only uses positive matches in calculating the localization cost. If the corresponding anchor box has an IoU greater than 0.5 with the ground truth, the match is positive. Otherwise, it is negative. It also uses a Non-max supression method to obtain the predicted box. The box with the maximum overlap with the ground truth is chosen. 41 | 42 | #### Loss Function: 43 | 44 | Like other object detection network and mostly like fast and faster RCNN the SSD uses a multi task loss function. Its loss function consists of two terms. One term serves as the loss function for the localization portion and the second term serves for the class prediction confidence score. 45 | 46 | Localization loss is given by a smooth L1 loss as in case of Fast RCNN. It is given by: 47 | 48 | ![loc_loss](https://miro.medium.com/max/590/1*yOU4TCmYwLgS06kvcfSWxA.png) 49 | 50 | Lloc is the localization loss which is the smooth L1 loss between the predicted box (l) and the ground-truth box (g) parameters. These parameters include the offsets for the center point (cx, cy), width (w) and height (h) of the bounding box. 51 | 52 | The Confidence loss for the class prediction is given by: 53 | 54 | ![Conf](https://miro.medium.com/max/770/1*flNOlxfJj0Pyt1lNZmQaLA.png) 55 | 56 | Lconf is the confidence loss which is the softmax loss over multiple classes confidences (c). 57 | 58 | The combined loss is given by: 59 | 60 | ![Total](https://miro.medium.com/max/900/1*vfpARz-ozzpb0jHM8yJ0zA.png) 61 | 62 | 63 | The alpha value is a weight variable and set to 1. 64 | 65 | References: 66 | 67 | 1. https://arxiv.org/pdf/1512.02325.pdf 68 | 2. https://towardsdatascience.com/review-ssd-single-shot-detector-object-detection-851a94607d11 69 | 3. https://jonathan-hui.medium.com/ssd-object-detection-single-shot-multibox-detector-for-real-time-processing-9bd8deac0e06 70 | 71 | 72 | 73 | 74 | 75 | 76 | 77 | 78 | 79 | -------------------------------------------------------------------------------- /Applications/Object Detection/YOLO/README.md: -------------------------------------------------------------------------------- 1 | ### YOLO (You Only Look Once) 2 | 3 | As the name suggests, YOLO algorithm is an one stage algorithm, devised for fast and robust object detection. Unlike the R-CNN family they do not work on selected areas areas or target regions proposed by region proposal networks and select search methods. These algorithms are detect objects all over the image, which is faster and simpler but drags down performances. 4 | 5 | 6 | References: 7 | 8 | 1. https://jonathan-hui.medium.com/real-time-object-detection-with-yolo-yolov2-28b1b93e2088 9 | 2. https://amrokamal-47691.medium.com/yolo-yolov2-and-yolov3-all-you-want-to-know-7e3e92dc4899 10 | -------------------------------------------------------------------------------- /Applications/Object Detection/YOLO/YOLO V1/README.md: -------------------------------------------------------------------------------- 1 | ### YOLO Version 1 or YOLO 2 | 3 | The YOLO algorithm is known to be one of the first one shot detectors use to detect multiple objects in an image in real time. The paper was proposed by Facebook AI research in 2016 in the paper titled: "You Only Look Once: 4 | Unified, Real-Time Object Detection". 5 | 6 | #### Idea 7 | The basic idea behind the YOLO algorithm is to divide the original input image is S x S subgrids and predict the object in each grid seperately and simultaneosly and thus speeding up the process. In the original paper, the value of S=7. So, the authors divided the entire image into 7x7=49 total grids. 8 | 9 | ![Divisions](https://miro.medium.com/max/770/1*9nikM2b0u-m67SJpQXftKA.png) 10 | 11 | For each grid cell, if the centre of an object falls in a grid cell, that grid cell is responsible for detecting the object. But one grid can have detect only one object. So, all over the 49 grids only 49 objects can be detected. 12 | 13 | ![Center](https://miro.medium.com/max/667/1*4Y1PaY3ZgxKt5w84_0pNxw.jpeg) 14 | 15 | For example, in the above figure, the cell marked by yellow is reponsible for detecting the person because the center of the person falls in that paeticular cell or grid. Now, in order to localize or get the idea of the location of the object detected, each grid can predict a fixed number of bounding boxes (shown in blue). In the original paper the number of bounding boxes that can be used is B=2, as shown in the image. 16 | 17 | The issue with the algorithm is it can only detect fixed number of objects and one grid can only detect one object. This prevents the algorithm from detecting smaller and multiple elements. 18 | 19 | For every cell grid, the YOLO predicts: 20 | 21 | 1. B boundary boxes using 4 values (bx, by, bh, bw) giving the center and the hieght and weight of the bounding boxes respectively, The values are normalized, so, the values fall between 0 and 1. 22 | 2. A confidence score for each bounding box. It shows the how likely is tha fact that, there are objects in the bounding boxes. It is also termed as objectness. 23 | 3. C conditional class probabilities, for each class that need to be detected in the image. 24 | 25 | So, the output shape of YOLOv1 algorithm is: S x S x (B x 5 + C). where S is the number of Grids, B is the number od bounding boxes and C is the number of classes to be classified. For the original network, it creates an output of dimension of 7 x 7 x (2 x 5 + 20) = 7 x 7 x 30 which is reshaped as (7,7,30) output dimensions. It creates and predicts the values for two boundary boxes. 26 | 27 | #### Architecture 28 | 29 | ![Arch](https://miro.medium.com/max/3840/1*9ER4GVUtQGVA2Y0skC9OQQ.png) 30 | 31 | The YOLO v1 architecture has 24 convolutional layered architecture: 20 layers of pretrained network followed by 4 more convolutional layers, further followed by 2 fully connected layers. One with 4096 nodes and the last one with 7x7x30 nodes which gives an output of (7, 7, 30) as we want. The architecure is inspired by GoogleNet's architecture and has 1x1 reduction layers to reduce dimensions along with 3x3 convolutions. The last layer predicts the outputs as linear Regression, as we do not use a softmax or sigmoid function. The CNN networks outputs the feature maps of dimension 7 x 7 x 1024. which is falttened and sent to fully connected layers. 32 | 33 | #### Loss function 34 | 35 | Similar to the RCNN family, the YOLO algorithm also uses a multipart multitask loss. It uses sum of squared errors SSE for each part and scale them using scaling or weightage variables to maintain the importance of the term in the loss function. 36 | 37 | ![loss](https://miro.medium.com/max/2400/1*aW6htqx4Q7APLrSQg2eWDw.png) 38 | 39 | Now, as we can see our loss function actually has total 5 parts which actually derived from three main losses: 40 | 41 | 1. The localization loss: It is the loss obtained as the sum of squared differences between the predicted bounding box and the actual bounding box. 42 | 43 | Given by: 44 | 45 | ![loc_loss](https://miro.medium.com/max/700/1*BwhGMvffFfqtND9413oiwA.png) 46 | 47 | The *lambda coord* controls the weight of localization error in the loss function. Now, as YOLO predicts bounding boxes for small as well as large objects the error is taken as the squared difference between the square root of the width and hieght instead of the actual values of width and hieght of the bounding boxes. The first part of the error gives the SSE between the center points of the bounding boxes. The *lambda coord* is kept at a default value of 5. 48 | 49 | 2. Confidence loss: It focuses on the confidence of the model about the objectness of the bounding box, or what is the probability of the box really containing an object. We consider an weighted sum of errors for the boxes containing the objects and the boxes not containing the objects. 50 | 51 | For the box containing the objects, the error is given by: 52 | 53 | ![contain](https://miro.medium.com/max/700/1*QT7mwEbyLJYIxTYtOWClFQ.png) 54 | 55 | If the box does not really contain the objects, the error is given by: 56 | 57 | ![not_contain](https://miro.medium.com/max/700/1*Yc_OJIXOoV2WaGQ6PqhTXA.png) 58 | 59 | A weight parameter is used the boxes, that does not have the objects, and kept a value of 0.5. This is kept as most of the boxes do not have any objects, it creates a huge imbalance, in classes. So, we use a weighted to loass function to give lower importance to the over represeneted class. 60 | 61 | 3. Classification loss: If an object is detected, the classification loss at each cell is the squared error of the class conditional probabilities for each class. 62 | 63 | ![class](https://miro.medium.com/max/700/1*lF6SCAVj5jMwLxs39SCogw.png) 64 | 65 | #### Training and Evaluation 66 | 67 | For training the, confidence value c for each bounding box is taken as the IoU value of the actual bounding box and the predicted bounding box. So, we obtain a score of 0 for bounding box which has no object, and a score of 1 for the boxes, which completely overlap with the actual boxes. 68 | 69 | The Evaluation is done using a metrics called the **class confidence score** given by: 70 | 71 | class confidence score = box confidence score x conditional class probability. 72 | 73 | It measures both the classification and the localization of the object. 74 | 75 | Defn: 76 | 77 | ![defn](https://miro.medium.com/max/700/1*0IPktA65WxOBfP_ULQWcmw.png) 78 | 79 | The algo also uses a concept of Non-Max Supression to obtain the best bounding box for an object, if it is detected by multiple bounding boxes. It selects the bounding box with the highest confidence value, rejecting the others. If other bounding boxes has a IOU >0.5 with the best selected box, the boxes are removed. 80 | 81 | Reference: 82 | 83 | 1. https://arxiv.org/pdf/1506.02640.pdf 84 | 2. https://towardsdatascience.com/yolov1-you-only-look-once-object-detection-e1f3ffec8a89 85 | 86 | 87 | 88 | 89 | 90 | 91 | 92 | 93 | 94 | 95 | -------------------------------------------------------------------------------- /Applications/Object Detection/YOLO/YOLO V2 and YOLO 9000/README.md: -------------------------------------------------------------------------------- 1 | ### YOLO V2 2 | 3 | The problem with YOLOv1 was that it can only detect 1 object per cell, so if there were multiple objects in 1 grid, it could not be detected plus it was impossible to detect the smaller objects. 4 | 5 | To overcome these problems, the authors Joseph Redmon and Ali Farhadi proposed YOLOv2. The speed and accuracy improved along with the capacity to detect more objects from an image in YOLO v2 compared to YOLO. YOLOv2 also improved on the recall and the localization errors. 6 | 7 | #### Modifications: 8 | 9 | 1. YOLOv2 introduced Batch-Normalizaion, which increased the mAP by 2%. 10 | 11 | 2. The YOLOv2 introduced a new neural network model with 19 convolutional layers and 5 maxpooling layers as the backbone CNN. It uses 3x3 filters for feature extraction, followed by 1x1 filter to reduce the number of output channels. It also uses a global average pooling layer. It is given as: 12 | 13 | ![model](https://miro.medium.com/max/464/1*8FiQUakp9i4MneU4VXk4Ww.png) 14 | 15 | The model gave an accuracy of 91.2% on the imagenet classification dataset which is better than VGG and YOLO networks. 16 | 17 | 3. The YOLO v2 model was trained on 224 x 224 size data for classification and fine-tuned. Then the resolution of the input images were increased to 448 x 448, and the model was trained and retuned with much fewer epochs, for detection. This makes the model robust and makes detector training easier and moved up the mAP by 4%. 18 | 19 | 4. The YOLO v2 removed one of the pooling layers to give better resolution. It was found that the performance is better if the feature maps are of odd spatial size. So, for that reason the input size was shifted from 448 x 448 to 416 x 416. The authors removed 1 pooling layer, so we got a higher resolution feature map of size 13 x 13 instead of 7 x 7. The feature map was down-sampled by a factor of 32. 20 | 21 | 5. The authors removed the last 2 fully connected layers as was present in the YOLOv1 architecture. This made the network capable of training on any size of the image. Only the size needed to be a multiple of 32, as the image is downsampled by a factor of 32. So, during training the authors used a variety of size from \[320,352,608]. So, for training, after 10 batches the size of the images are resized randomly using one of the given sizes. This give a sense of data augmentation while training and also helps the model to train on multi-scale or resolution of data. This forces the model to predict well on different scale and dimension of input images. 22 | 23 | The above are some of the modifications made in case of YOLO v2 now, we see two of the most major changes YOLO v2 brought. 24 | 25 | In YOLO v1, we could detedct only one object and to detect the location and size of the object we used two bounding boxes per cell. It was observed that the size of the bounding boxes was the factor that the model took more time to learn. And also, it was observed, if we fix the bounding boxes, we can detect several objects easily. Like, if we use a bounding box with aspect ratio of 0.41 we can detect a human easily, and for every car the shape is nearly the same. So, the authors initialized 5 fixed bounding box shapes, for each grid or cell of the SxS cells the image was divided in, and named them **Anchor Boxes** previously _Priors_. So, initially the training is more stable. 26 | 27 | ![anchor](https://miro.medium.com/max/504/1*8Q8r9ixjTiKLi1mrF36xCw.jpeg) 28 | 29 | Each of the anchor boxes could detect an object. So, we went from detection per cell to detection per anchor box. 30 | The bounding box coordinates were changed. They were now expressed with respect to the anchor boxes. For each anchor box,the model still predicted 5 values, _to_ the box confidence score and the box offsets with respect to the anchor box. 31 | 32 | ![pred](https://miro.medium.com/max/700/1*38-Tdx-wQA7c3TX5hdnwpw.jpeg) 33 | 34 | The below image shows the exact locations. 35 | 36 | ![pred_anchor](https://miro.medium.com/max/700/1*gyOSRA_FDz4Pf5njoUb4KQ.jpeg) 37 | 38 | Consequently, the output dimension also changed. For each anchor box, we had 1 confidence score, 4 bounding box offsets, and the scores of 20 classes, we need to classify the objects in. So, the full prediction, for an anchor box is of size 25. Each grid has 5 anchor boxes, so in total, the dimension for a grid is of size 25x5= 125. Now, as we saw previously, we obtain a feature map of 13x13 after the backbone network. So, we generate an output of size (13,13,125). As we already know, the fully connected layers are removed, the authors have used an 1x1 convolutions to convert the dimension of the 13 x 13 x 1024 (of the feature map) into the output 13 x 13 x 125. 39 | 40 | The number of anchor-boxes is set to 5, using a clustering method. Using the clustering, we identify the top-K boundary boxes that have the best coverage for the training data, we run K-means clustering on the training data to locate the centroids of the top-K clusters. 41 | 42 | ![Cluster](https://miro.medium.com/max/700/1*l5wvrPjLlFp6Whgy0MqbKQ.png) 43 | 44 | YOLOv2 also brought a major change in the structure of the network, in order to obtain fine grain features. The 13x13 feature map can be used to detect large objects but no finer features can be detected, so no smaller objects detected. To solve this, the authors, picked up 26x26x512 feature maps from the previous layers in the backbone, then reshapesd it to (13,13,2048) and contatenated with the final downsampled feature map of (13,13,1024), and obtained a map of (13,13,3072), then predicted on that map. 45 | 46 | ![fine_graib](https://miro.medium.com/max/700/1*RuW-SCIML8SHc5_PrIE9-g.jpeg) 47 | 48 | The training of YOLOv2 is same as v1. The IOU of the actual boxes and the bounding boxes, give the box confidence score. 49 | 50 | ### YOLO 9000 51 | 52 | One particular problem was observed, that the available data was very less, so, the classes categories available for detection is much less than the corresponding classfication problems. To expand the classes that YOLO can detect, YOLO proposes a method to mix images from both detection and classification datasets during training. It trains the end-to-end network with the object detection samples while backpropagates the classification loss from the classification samples to train the classifier path. So, for classiication we train the classification model of YOLO v2. The authors finally made the network to detect 9000 objects by using multiple datasets. 53 | 54 | The problems with this was, how to merge labels and if we do so, the labels may not be mutually exclusive. For example, ImageNet dataset has more than a hundred breeds of dogs like “german shepherd” and “Bedlington terrier” where as MS-cOCO has only "dog" Class. To solve this the authors merge these datasets creating a hierarchical model of visual concepts and called it **wordtree**. 55 | 56 | ![Word Tree](https://miro.medium.com/max/700/1*nskRV4Wt9Dul-6Jv1x09Bw.png) 57 | 58 | Say, for the imageNtes 1000 classes, we create a WordTree1k. We create 369 nodes as their parent classes and add them as an intermidiate layer in the tree. Initially YOLO used a single softmax layer to detect all the classes, now, we will be using seperate softmax layers for all child nodes seperately. So, “german shepherd” and “Bedlington terrier” will be identified in the next layer using a seperate softmax after the first layer identiefies its parent class to be dog. It is completely based on a tree structure, we choose the node with highest probabaiity and travel that branch. 59 | 60 | In YOLO 9000. the authors trained the network on 9000 class categories with 418 intermidiate nodes. 61 | 62 | 63 | Refernce: 64 | 65 | 1. https://arxiv.org/pdf/1612.08242.pdf 66 | 2. https://towardsdatascience.com/review-yolov2-yolo9000-you-only-look-once-object-detection-7883d2b02a65 67 | 3. https://medium.datadriveninvestor.com/review-on-yolov2-11e93c5ea3f1 68 | 69 | 70 | 71 | 72 | 73 | 74 | 75 | 76 | 77 | 78 | 79 | 80 | 81 | -------------------------------------------------------------------------------- /Applications/Object Detection/YOLO/YOLO V3/README.md: -------------------------------------------------------------------------------- 1 | ### YOLO V3 2 | 3 | Yolo V3 was published with some improvements and some tuning of old ideas from YOLO v2 by Joseph Redmon and Ali Farhadi. 4 | 5 | It brought some changes to the idea of how YOLO v2 detected objects, the architecture and the loss functions used. 6 | 7 | It was noticed that, YOLOv2 considers all the labels mutually exclusive, i.e, it can predict only one label for an object detected. But, if there is a child walking on the street, it can be classified as both "pedestrian" and "child" and the labels are not mutually exclusive. This multilabel classification is not possible in YOLO v2 as it uses, Softmax function in the last layer. For softmax all probabilities sum up to 1 and it outputs the class with maximum probabiity. So, in order to create a multilabel detection, YOLO v3 replaces the final softmax layer with independent logistic regression for individual labels to predict the likeliness of the specific label. So, instead of using a mean square error as loss function, the authors use classification loss. As, we are using individual logistic regression, for each object, so, the model can either predict 0 or 1, so a binary cross-entropy as the loss function. This also reduces computation complexity. 8 | 9 | YOLOv3 also uses logistic regression for the confidence score. There is one bounding box for each object. If the bounding box overlaps with the ground truth more than any other bounding boxes the objectness score is 1. If other predicted bounding box overlaps with the ground truths more than the threshold (0.5), no classification loss and localization loss is incurred. Other bounding boxes are ignored. 10 | 11 | The network predicts 4 coordinates for each bounding box, tx, ty, tw, th. If the cell is offset from the top left corner of the image by (cx,cy) and the bounding box prior has width and height Pw, Ph, then the predictions correspond to: 12 | 13 | ![bounding](https://miro.medium.com/max/235/1*L2mpfhg2tixnWJqMSyoEEQ.png) 14 | 15 | where, 16 | 17 | ![box](https://miro.medium.com/max/402/1*K5_cJ0wy_7uOxXvrm-4rQw.png) 18 | 19 | #### Change in modelling 20 | 21 | YOLO struggles to detect small objects as due to the downsampled feature maps, the finer details are missed. YOLO v3 improves the small object detections using short cut connections, which helps to allow obtain finer features. YOLO v3 predicts at three levels instead of YOLO V2's single level prediction. So, YOLO v3 adapts and functions like a feature pyramid network, and uses different sized feature maps for prediction. 22 | 23 | YOLO v3 predicts from the last feature maps. Then, it goes back 2 layers back and upsamples it by 2. YOLOv3 then takes a feature map with higher resolution and merge it with the upsampled feature map using element-wise addition. YOLOv3 apply convolutional filters on the merged map to make the second set of predictions. Agein it repeats the same above process to obtain the third set of predictions. 24 | 25 | #### DarkNet-53: 26 | 27 | The YOLO-v3 introduces DarkNet-53 replacing Darknet-19 as the feature extractor. It consists of 3x3 convolutions followed by 1x1 filters for dimensionality reductions. Skip connections like the residual networks has been introduced in Darknet-53 as an additional change. 28 | 29 | ![Model](https://miro.medium.com/max/334/1*biRYJyCSv-UTbTQTa4Afqg.png) 30 | 31 | 32 | The above image gives the model. 33 | 34 | Apart from all these changes, authors applied a K-means clustering and chose 9 anchor boxes with pre-decided dimensions based on the size and dimension of the input image. The 9 anchor boxes are of the size: (10×13),(16×30),(33×23),(30×61),(62×45),(59× 119),(116 × 90),(156 × 198),(373 × 326) respectively. Now, these 9 boxes are divided in groups of 3 based on their dimensions. Each group is assigned one of the 3 set of predictions that the model produces, as discussed above. The assignment is done based on the dimension of the feature maps at each level of prediction and the size of the anchor boxes. 35 | 36 | At each level of prediction, 3 anchor boxes are assigned for each grid cell, so, for each predicted output, the size of the output is N x N x \[3 x (4 + 1 +80)], where N is the number of grids the feature map is divided into, 3 is the number of anchor boxes, 4 are boundary box predictions, 1 is the objectness score, and 80 are the total number of classes to be predicted. 37 | 38 | References: 39 | 40 | 1. https://arxiv.org/pdf/1804.02767v1.pdf 41 | 2. https://medium.datadriveninvestor.com/review-on-yolov3-faae0dd59425 42 | 3. https://towardsdatascience.com/yolo-v3-object-detection-53fb7d3bfe6b 43 | 44 | -------------------------------------------------------------------------------- /Applications/Object Detection/YOLO/YOLO v4 and v5/README.md: -------------------------------------------------------------------------------- 1 | ### YOLO V4 2 | 3 | YOLO v4 was introduced by Alexey Bochkovskiy in the paper titled **YOLOv4: Optimal Speed and Accuracy of Object Detection**. The paper made major modifications from the YOLO v3 architecture. The changes were mostly made in the architecture of the backbone model. 4 | 5 | #### Architecture: 6 | 7 | ![Arch](https://miro.medium.com/max/3660/1*Z5GOPYFgh7_NTr7drt45mw.png) 8 | 9 | The image above shows the architecture of the YOLO v4 network. It has 3 seperate portion as a one stage detector. The input is first passed through a Backbone network, which can be pretrained networks like VGG, ResNet, DenseNet or Darknet53 as used in YOLOv3. The paper mentions the backbone to be CSPDarknet53, which stands for Cross-Stage-Partial-Darknet, which means that the input feature maps are divided into two parts, one of them is sent through the module,while other is directly added with the results, and the resultant is sent forward, as shown in the figure below for DenseNet. The main function of this part is to extract features from to 10 | 11 | ![CSP](https://miro.medium.com/max/2400/0*6wqDSouRXeJIgAYj.jpeg) 12 | 13 | The Next layer is called the Neck. This is the part connects the backbone with the head, which is the main prediction layer. YOLO v4 tries to extract predictions at different levels to get coverage of both large and smaller units. The larger feature maps help to obtain the finer features and hence the smaller objects. These finer features get omitted when we move further in the dense layers, due to the downsampling. At the deeper levels the larger objects are more robustly detected. 14 | 15 | The Neck portion mostly contains of layers like, Path Aggregated Networks (PaNets), Spatial Pyramidal Pooling Network (SPP_Nets) or Spatial Attention Modules (SAM). These networks are used to draw, multi scale and dimensional features to generate more robust set of features. 16 | 17 | Then comes the Head or the Detector portion. It detects the bounding boxes and the predicts the classes, just like YOLOv3. 18 | 19 | The YOLOv4 also introduces two types of techniques to boost up the performance of the models. 20 | 21 | The two methods are: 22 | 23 | 1. **Bag of Freebies**: It is a set of techniques that help during training without adding much inference time. Some popular techniques include data augmentation, random cropping, shadowing, dropout, etc. 24 | 25 | 2. **Bag of Specials**: Unlike BoF, they change the architecture of the network and thus might augment the inference cost a bit. SAM, PAN, and SPP, all belong to this family. 26 | 27 | These techniques were used in the Backbone, the neck and the head detector regions of the networks as suggested by the authors. 28 | 29 | Reference: 30 | 31 | 1. https://arxiv.org/pdf/2004.10934.pdf 32 | 2. https://becominghuman.ai/explaining-yolov4-a-one-stage-detector-cdac0826cbd7 33 | 3. https://towardsdatascience.com/yolo-v4-optimal-speed-accuracy-for-object-detection-79896ed47b50 34 | 4. https://blog.roboflow.com/a-thorough-breakdown-of-yolov4/ 35 | 36 | ### YOLO v5 37 | 38 | YOLO-v5 was soon introduced after YOLO-v4 was introduced. It was first introduced in a github repo by Glenn Jocher the founder and CEO of Ultralytics. It's architecture is pretty similar to YOLO-v4. The details can be found in the references. 39 | 40 | Reference: 41 | 42 | 1. https://github.com/ultralytics/yolov5 43 | 2. https://blog.roboflow.com/yolov5-is-here/ 44 | 3. https://blog.roboflow.com/yolov4-versus-yolov5/ 45 | 46 | 47 | 48 | 49 | 50 | 51 | 52 | 53 | -------------------------------------------------------------------------------- /Applications/Object Recognition/README.md: -------------------------------------------------------------------------------- 1 | ### Object Recognition 2 | 3 | This field in image processing deals with detecting what type of objects are present in an image, and find them in the image. Basically people end up getting cofused about Object Recognition, Object Detection, Object Classification and Object Localization. 4 | 5 | Object localization is the task of locating the object in an image using a bounding box. 6 | Object Classification is the task of finding or predicting what type of object it is. 7 | Object Detection is the combined task of Object classification and Object Localization, 8 | 9 | All these three tasks are termed as Object Recognition together. The Object Recognition algorithm mainly include: 10 | 11 | 1. RCNN family 12 | 2. YOLO family 13 | -------------------------------------------------------------------------------- /Applications/Semantic Segmentation/DeepLab Family/README.md: -------------------------------------------------------------------------------- 1 | ### Deeplab Family 2 | 3 | Google proposed a family of papers in the field of semantic segmemtation. The papers brought a set of new ideas in the field. The set of 4 papers: DeepLab V1, DeepLab V2, DeepLab V3, DeepLab V3+ consecutively outperformed the previous one and all the contemporary algorithms. 4 | 5 | Concpts Proposed: 6 | 1. Version 1: Atrous/Dilated convolutions, Conditional Random Fields (CRF) 7 | 2. Version 2: Atrous Spatial Pyramidal Pooling 8 | 3. Version 3: Parallel processing approach on Atrous Spatial Pyramidal Pooling 9 | 4. Version 3+: Normalization and decoder unit on Version 3, and atrous seperable convolutions 10 | 11 | ### Version 1: 12 | 13 | Deeplab V1 first proposed Atrous/ Dilated convolutions. The problem seen in FCN was, it needed a 32x upsampling, as we needed to go deep in the network to obtain finer features, which requires downsampling of the image. The downsampling is actually required because in order to obtain global information, we need to capture the larger context in the image, which either requires, excessive downsampling or larger kernels, which hugely increases the number of parameters. So, we usually go with the first one, which causes loss in local information due to the downsapling. But in case of segmentation, we need both global and local information. 14 | 15 | So, dilated or atrous convolutional was introduced. Atrous convolution or the hole convolution or dilated convolution which helps in getting an understanding of large context using the same number of parameters. Dilated convolution works by increasing the size of the filter by appending zeros(called holes) to fill the gap between parameters. The number of holes/zeroes filled in between the filter parameters is called by a term dilation rate. When the rate is equal to 1 it is nothing but the normal convolution. When rate is equal to 2 one zero is inserted between every other parameter making the filter look like a 5x5 convolution. Now it has the capacity to get the context of 5x5 convolution while having 3x3 convolution parameters. Similarly for rate 3 the receptive field goes to 7x7. 16 | 17 | ![Dilated](https://nanonets.com/blog/content/images/2020/08/main-qimg-d9025e88d7d792e26f4040b767b25819.png) 18 | 19 | The deeplab replced the pooling and the some convolutional layers with Dilated convolutions, with stride 1. This helped to obtain larger context, thus better global features, and also restricts the downsampling to 8x, which prevents huge loss of local features. The 8x upsampling was possible using bilinear upsampling. [WIKI](https://en.wikipedia.org/wiki/Bilinear_interpolation) 20 | 21 | Next, it was seen that the often the output by bilinear interpolation was coarse in nature, in such circumstances, CRF or conditional Random Fields were used to smoothen out the output. 22 | 23 | ![CRF](https://miro.medium.com/max/1284/1*r7dUU5Gb3LaWCr7tnKYm2w.png) 24 | 25 | The idea behind CRF is to minimize a randomness or an energy function, which is a summation of the log proabability of the prediction, and an weighted sum of results of a bilinear filter and a gaussian filter. The idea is to penalize the model for predicting different labels for pixels very close to each other (Focused by the gaussian kernel) and the pixels which are close to each other having similar intensity values (Focused on by the Bilinear Kernel). The first term in the loss function depicts the log probability, and the second term focuses on the discussed penalties. 26 | 27 | ![Res](https://miro.medium.com/max/729/1*0omVBDSf5sucAqsFBastiQ.png) 28 | 29 | ### Version 2: 30 | 31 | Deeplab V2 introduced Atrous Spatial Pyramidal Pooling layers, inspired from Spatial Pyramidal Pooling layers, which captured multi-scale information, from single input images, using 1x1, 2x2, and 4x4 convolutional layers and finally concatenated the feature maps as 1d vectors. 32 | 33 | ASPP takes the concept of fusing information from different scales and applies it to Atrous convolutions. The input is convolved with different dilation rates and the outputs of these are fused together. 34 | 35 | ![ASPP](https://nanonets.com/blog/content/images/size/w1000/2020/08/deeplab_aspp.jpg) 36 | 37 | The input is convolved with 3x3 filters of dilation rates 6, 12, 18 and 24 and the outputs are concatenated together since they are of same size. A 1x1 convolution output is also added to the fused output. The multi-scale information from ASPP helps in improving the results. 38 | 39 | CRF is used here also as a post processing step to improve results. 40 | 41 | ### Version 3: 42 | 43 | Version 3 introduced a parallel processing measure for atrous convolutions and also introduced batch normalization. 44 | 45 | It focused on the fact that using Atrous convolutions, we can actually stack layers, without decreasing the resolution of the image. 46 | 47 | ![Version3](https://miro.medium.com/max/2566/1*8Lg66z7e7ijuLmSkOzhYvA.png) 48 | 49 | It removed the CRF post processing and used the ASPP idea combined with a global Average Pooling just to have a better global feature based idea. 50 | 51 | ### Version 3+ 52 | 53 | The Deeplab V3+ combined the ideas of the encoder-decoder limb and ASPP net to create a new ASPP Encoder-Decoder based network. 54 | 55 | ![Comp](https://nanonets.com/blog/content/images/size/w1000/2020/08/1_Llh9dQ1ZMBqPMOJSf7WaBQ.png) 56 | 57 | Its main idea was to build a decoder limb on the output of the ASPP layer of Deeplab V3, and remove the biliear interpolation method for upsampling. 58 | 59 | ![ARCH_V3+](https://miro.medium.com/max/770/1*MFchBd4c8ZEgE3qtbnTznw.png) 60 | 61 | DeepLabV3+ proposes to add an intermediate decoder module on top of the DeepLabV3. After processing via DeepLabV3, the features are then upsampled by x4. They are then further processed along with the original features from the feature extraction frontend, before being upscaled again by x4. This lightens the load from the end of the network and provides a shortcut path from the feature extraction frontend to the near end of the network. 62 | 63 | References: 64 | 65 | 1. https://arxiv.org/pdf/1606.00915.pdf 66 | 2. https://arxiv.org/pdf/1412.7062.pdf 67 | 3. https://arxiv.org/pdf/1706.05587.pdf 68 | 4. https://arxiv.org/pdf/1802.02611.pdf 69 | 5. https://arxiv.org/pdf/1502.03240.pdf 70 | 6. https://sh-tsang.medium.com/review-deeplabv3-atrous-separable-convolution-semantic-segmentation-a625f6e83b90 71 | 7. https://towardsdatascience.com/review-deeplabv3-atrous-convolution-semantic-segmentation-6d818bfd1d74 72 | 8. https://towardsdatascience.com/review-deeplabv1-deeplabv2-atrous-convolution-semantic-segmentation-b51c5fbde92d 73 | 9. https://developers.arcgis.com/python/guide/how-deeplabv3-works/ 74 | 10. https://github.com/sadeepj/crfasrnn_keras 75 | 11. https://medium.com/@ihor.shylo/improving-performance-of-image-segmentation-with-conditional-random-fields-crf-8b93f7db396c 76 | -------------------------------------------------------------------------------- /Applications/Semantic Segmentation/DenseNet based FCN/README.md: -------------------------------------------------------------------------------- 1 | ### FC Dense-Nets 2 | 3 | The Montreal Institute for Learning Algorithms, Ecole Polytechnique de Montreal, Imagia Inc., and Computer Vision Center persued another modification to the U-net architecture and proposed DenseNets to replace the basic convolutional blocks of the Unet. 4 | 5 | The authors focused on the facts that the DenseNets have skip connections, which will help the model to learn better features and also the model works on a multi-scale space, i.e, it learns from multiple scaled images at the same time, which helps it to extract richer feature. 6 | 7 | ![Densenet](https://miro.medium.com/max/770/1*rmHdoPjGUjRek6ozH7altw.png) 8 | 9 | The dense blocks are useful as they carry low level features from previous layers directly alongside higher level features from more recent layers, allowing for highly efficient feature reuse. 10 | 11 | ![Arch](https://miro.medium.com/max/716/1*5Bqcgzl6JDXrScL1RXd6Ag.png) 12 | 13 | The above shows the Architecture of the proposed densenet based Unet architecture. 14 | 15 | ![Layers](https://miro.medium.com/max/1970/1*1Pj56mTHPNha8Pg58fWJEQ.png) 16 | 17 | The above give the details of the layers used in the architecture. 18 | 19 | The model architecture uses a skip connection over the DenseNet modules in the encoder limb, to have better gradient flow, but it is not present in the decoder limb, as it becomes "Memory Demanding". The concatenation of feature maps between the encoder and the decoder limb remains as it was in the original Unet architecture, for the model to learn the local information better. 20 | 21 | References: 22 | 23 | 1. https://arxiv.org/pdf/1611.09326.pdf 24 | 2. https://arxiv.org/pdf/1608.06993.pdf 25 | 3. https://towardsdatascience.com/review-fc-densenet-one-hundred-layer-tiramisu-semantic-segmentation-22ee3be434d5 26 | 4. https://towardsdatascience.com/review-densenet-image-classification-b6631a8ef803 27 | 28 | 29 | 30 | 31 | -------------------------------------------------------------------------------- /Applications/Semantic Segmentation/Fully Convolutional Network/README.md: -------------------------------------------------------------------------------- 1 | ### Fully Convolution Networks 2 | 3 | Fully convolutional Networks was proposed by Jonathan Long, Evan Shelhamer from UC Berkeley in the paper titled "Fully Convolutional Networks for Semantic Segmentation". 4 | 5 | The idea behind the Fully Convolutional Network evolved from the previously used image classification methods, which used a convolutional network, a corresponding Fully connected layer and finally a softmax layer for prediction. Now, in image classification we did a single prediction. 6 | 7 | The authors replaced the dense layers using 1x1 convolutional layer, with a softmax activation instead of ReLU activation. Now, we need dense predictions, in order to do that, we need to predict for each pixel. But we know, to extract finer features, we must go deep in tha convolutional net, and apply Maxpooling, which decreases the resolution of the image. So, the size of the final feature map is less than the original image. To make pixelwsie predictions, we need the feature map on which we will predict the output, of the same resolution as the input image, so we upsample the feature map to accomplish it. 8 | 9 | ![Network](https://miro.medium.com/max/559/1*NXNGhfSyzQcKzoOSt-Z0Ng.png) 10 | 11 | The paper replaced the upsampling layers with deconvolutional layers, in order to make the model learn in a better way, as upsampling usage has an aliasing effect. 12 | 13 | It is known that, we need to use deeper network to learn finer features, which decreased the feature map resolution, so the authors needed to upsample 32x times, in order to predict. This is why, it was called FCN 32. The output was found to be very coarse and rough. 14 | 15 | It was discovered that, the location information which is mostly spatial localized information was lost, which prevented the model from giving fine-grained predictions. So, the deep model actually, got better at global information, but lost local information, and both of them are necessary for Segmentation problem. 16 | 17 | The authors found that the feature maps, found in the shallower layers of the convolutional networks had better spatial information owing the higher resolution of the maps compared to the deeper layer ones, So, the authors devised a skip connection strategy. 18 | 19 | The authors combined feature maps from the shallower pool layers with the upsampled feature maps, just to provide the spatial local informations better and upsampled from those feature maps to have better information. 20 | 21 | ![FCN](https://miro.medium.com/max/770/1*lUnNaKAjL-Mq10v3tIBtJg.png) 22 | 23 | As we can see, instead of directly upsampling 32x times, we upsample it 2x times and then combine with feature map from the corresponding pooing layer. Then, that feature map is upsampled 16x, and the predictions are made on that map. This model was termed as FCN-16. Similarly, we obtain FCN-8. 24 | 25 | The later modellings were known to obtain much better segmentation results. 26 | 27 | ![FCN_res](https://miro.medium.com/max/630/1*tcYqvV0KHjK2ANBGe8GpjQ.png) 28 | 29 | References: 30 | 1. https://arxiv.org/pdf/1411.4038.pdf 31 | 2. https://towardsdatascience.com/review-fcn-semantic-segmentation-eb8c9b50d2d1 32 | 33 | -------------------------------------------------------------------------------- /Applications/Semantic Segmentation/MaskRCNN/README.md: -------------------------------------------------------------------------------- 1 | ### Mask RCNN 2 | 3 | Mask RCNN is a modified version of the objct detection network Faster RCNN. Faster R-CNN consists of two stages. The first stage, called a Region Proposal Network (RPN), proposes candidate object bounding boxes. The second stage, which is in essence Fast R-CNN, extracts features using RoIPool from each candidate box and performs classification and bounding-box regression. 4 | 5 | Mask-RCNN just added a third branch to the object detection, and bounding box regression, i.e, the object mask prediction. For every output, we get the class label, bounding box offset and a binary mask to locate the object in the box. Mask RCNN uses the Fully Convolution Network (FCN) to predict the masks for the objects. 6 | 7 | ![Arch_mask_rcnn](https://miro.medium.com/max/700/1*dYb3w2iVxkN7Ifx-eA8ZRg.jpeg) 8 | 9 | Mostly the algorithm is used in Instance Segmentation. 10 | 11 | References: 12 | 13 | 1. https://arxiv.org/pdf/1703.06870.pdf 14 | 2. https://alittlepain833.medium.com/simple-understanding-of-mask-rcnn-134b5b330e95#:~:text=Mask%20RCNN%20is%20a%20deep,two%20stages%20of%20Mask%20RCNN. 15 | 3. https://medium.com/analytics-vidhya/review-mask-r-cnn-instance-segmentation-human-pose-estimation-61080a93bf4#:~:text=Mask%20R%2DCNN%20is%20one,based%20on%20Mask%20R%2DCNN. 16 | 4. https://github.com/matterport/Mask_RCNN 17 | -------------------------------------------------------------------------------- /Applications/Semantic Segmentation/PSPNet and Fast FCN/README.md: -------------------------------------------------------------------------------- 1 | ### Pyramid Scene Parsing Network (PSPNet) 2 | 3 | PSPNet focussed on the idea of using the multiple scaling to extract and capture finer features and information. 4 | 5 | ![Arch_PSP](https://miro.medium.com/max/2608/1*IxUlWP8RBtxNS1N6hyBAxA.png) 6 | 7 | As shown in the figure, the PSPNet primarily focused on extracting the information from the images using a Resnet or Dilated convolution based deep architecture. After obtaining the feature maps, as shown, it uses 4 different average pooling operation with 4 different window sizes and strides. This creates 4 different scaling of the same image. Now, a 1x1 convolution is used on all the 4 scaled images, in order to reduce dimension. 8 | 9 | Finally the Resultant from all the 4 different scaled images are concatenated with the extracted feature maps. Bilinear Transformation is applied on the concatenated result in order to upsample the feature map, and finally 1x1 convolution is applied to obtain the predictions. 10 | 11 | References: 12 | 13 | 1. https://arxiv.org/pdf/1612.01105.pdf 14 | 2. https://towardsdatascience.com/review-pspnet-winner-in-ilsvrc-2016-semantic-segmentation-scene-parsing-e089e5df177d 15 | 16 | 17 | ### Fast FCN 18 | 19 | Fast FCN proposed the replacement of the dilated convolution with **Joint Pyramid Upsampling(JPU).** 20 | 21 | ![Arch](https://miro.medium.com/max/700/1*qIjtbbsVYqRwpTs6-7_i8Q.png) 22 | 23 | The differences between Fast FCN's backbone and DilatedFCN's lies in the last two convolution stages. Typically, the input feature map is first processed by a regular convolution layer, followed by a series of dilated convolutions. Fast FCN conceptually processes the input feature map with a strided convolution and then employs several regular convolutions to generate the output. 24 | 25 | ![JPU](https://miro.medium.com/max/700/0*YrT7w4KaIPPUGsgt.png) 26 | 27 | Each feature map is passed through your regular convolutional block. Afterwards, the feature maps are upsampled and concatenated which then is passed through four convolutions with different dilation rates. Finally, the convolution's results are concatenated again and passed through a final convolution layer. JPU extracts multi-scale context information from multi-level feature maps, which leads to a better performance. 28 | 29 | References: 30 | 31 | 1. https://arxiv.org/pdf/1903.11816.pdf 32 | 2. https://flonelin.wordpress.com/2019/11/17/fastfcn/ 33 | 3. https://github.com/wuhuikai/FastFCN 34 | 35 | -------------------------------------------------------------------------------- /Applications/Semantic Segmentation/README.md: -------------------------------------------------------------------------------- 1 | ### Semantic Segmentation 2 | 3 | Semantic segmentation is said to be a dense prediction method, as in this method, we make pixel wise predictions, that is we predict each pixel to be a back ground or belonging to a fore ground object. 4 | 5 | ![example](https://www.jeremyjordan.me/content/images/2018/05/Screen-Shot-2018-05-17-at-7.42.16-PM.png) 6 | 7 | As shown in the above, we classify the pixels in three classes: Background, Person and Bicycle. One thing to notice is, we marked all the instance of the person and the bicycle same, where as, in instance segmentation, we colour each of the instances of an object differently. 8 | 9 | ![Segmentation](https://www.jeremyjordan.me/content/images/2018/05/Screen-Shot-2018-05-17-at-9.02.15-PM.png) 10 | 11 | The segmentation labelling process is something like as shown above, we assign labels to each pixel. So, we basically create a mask of labels of the same size as the image size and over lap them, to find the pixel wise labels. 12 | 13 | Now, to do this we can simply use stacked convolutional layers without dimensional change, but we it would be too computationally expensive to preserve the full resolutional image all through. So, several models have been proposed to solve this problem. Most significant of them are: 14 | 15 | 1. Fully Convolutional Network 16 | 2. U-net and its variants 17 | 3. Mask-RCNN 18 | 4. Pyramid Scene Parsing Network(PSPNet) 19 | 5. Fully Convolutional DenseNets 20 | 6. Deeplab family 21 | 7. FastFCN —Fast Fully-connected network 22 | 23 | ### Popular Metrics 24 | 25 | Apart from normal classification problem metrics, like Precision, Recall, F1 score and accuracy, we have some other metrics. They are: 26 | 27 | **Specificity:** Specificity is the measure of the accuracy of learning the background. 28 | 29 | Given by: True negative/(True Negative+False Positive) 30 | 31 | **Sensitivity:** Sensitivity is the measure of the True Positive or how good the target class has been learned. 32 | 33 | Given by: True positive/(True Positive+False Negative) 34 | 35 | **Dice Coefficient** = 2* True positive / (2* True positive + False Positive + False Negative) 36 | 37 | **Jaccard Index** = True positive / (True positive + False Negative + False Positive ) 38 | 39 | In most segmentation problems, there is a huge imbalance between the foreground and the background classes. As foreground is very small with respect to background, the model learns the back ground better and predicts fore grounds as backgrounds which increases the False Negatives. So, the above metrics are used to keep a tab on how the model is learning both the foreground and the back ground classes. Dice coefficient and Jaccard Index are very common as they focus on the foreground only. 40 | 41 | ### Loss Functions 42 | 43 | Normally, for a segmentation problem, we can use "Binary Crossentropy" or "Categorical Crossentropy" accordingly based on the number of foreground classes. If we have one foreground class, we need to classify between background and the foreground object, we can use binary crossentropy, else, in case of multiple foreground objects we need to use categorical crossentropy, just similar to classification problem. 44 | 45 | But, due to the imbalance problem, using standard crossentropy loss function, prove to have low performance. So, we need to use other weighted loss functions that penalizes more when the model mistakenly classifies a foreground pixel as a background compared to if it wrongly classifies a background pixel. 46 | 47 | We have a few choices using custom loss functions: 48 | 49 | 1. Focal Loss 50 | 2. Weighted Crossentropy 51 | 3. Dice-Coefficient Loss 52 | 4. Jaccard Loss 53 | 5. Traversky Loss 54 | 6. Focal Traversky Loss 55 | 56 | Details can be found at: https://medium.com/swlh/handling-highly-imbalanced-datasets-in-convolutional-neural-networks-b71530d34ed 57 | 58 | Focal Traversky Loss is given by (1-Traversky Index)^Gamma. Gamma controls how much we want to increase the magnitude of the error if a misclassification is done. 59 | 60 | More at 61 | 1. https://towardsdatascience.com/dealing-with-class-imbalanced-image-datasets-1cbd17de76b5 62 | 2. https://cnvrg.io/semantic-segmentation/ 63 | 64 | 65 | 66 | 67 | 68 | -------------------------------------------------------------------------------- /Applications/Semantic Segmentation/U-Net and Variants/README.md: -------------------------------------------------------------------------------- 1 | ### U-Net 2 | 3 | The idea of U-Net was developed from the idea of Fully Connected Network's skip connection. The idea was mainly to make the upsampling limb stronger. The authors proposed a symmetric model structure. 4 | 5 | ![Unt_arch](https://www.jeremyjordan.me/content/images/2018/05/Screen-Shot-2018-05-20-at-1.46.43-PM.png) 6 | 7 | The above image shows the proposed architecture. There exists two symmetric limbs in these archiecture. The encoder limb or the contracting path extracts the finer features and captures important global information, while the expanding path or the decoder limb upsamples the the feature map and enables precize localizations. 8 | 9 | The encoder limb consists of blocks of 2 3x3 convolutions followed by a maxpool layer. THere are 4 such blocks. The decoder limb consists of blocks of 2 3x3 convolutions and a up-convolution to upsample the feature map on 2x scale. There are 4 such blocks. Just before passing the feature maps upsampled from the last layer of the decoder limb through the convolutional layers in the next limb blocks, they are concatenated with the features maps from the same level before downsampling in the contraction path. This helps to give the localization information from contraction path to expansion path. Finally, a 1x1 convolution is used to produce the prediction.The encoder and the decoder blocks are connected using a bottleneck block having 3 convolutional layers 10 | 11 | Initially, the model used zero padding for the convolutional layers so, the input image size did not match with the output size. To deal with it, the authors proposed an overlap tile strategy involving a mirroring technique and some upsampling and downsampling. Later, it was modified to have padding for the convolutional layers and the input and output resolutions became equal. 12 | 13 | ![over](https://miro.medium.com/max/1114/1*GNB3UkI-hErQwvL-jDLU7A.png) 14 | 15 | The authors also, introduced the idea of data augmentation in order to train the models, as the segmentation data creation takes lots of time and the datasets are not that vast. 16 | 17 | ### Variants: Resnet + Unet 18 | 19 | Soon after the success of the Unet architecture, "The Importance of Skip Connections in Biomedical Image Segmentation" proposed to change the basic convolutional blocks with residual blocks. The paper proposed three types of blocks: A basic block, A simple block, and A bottleneck block. The three blocks make up the redesigned U-Net architecture. The paper says that the short skip connections among the blocks themselves allow for faster convergence when training and allow for deeper models to be trained. 20 | 21 | ![New](https://vitalab.github.io/article/images/resunet/sc01.png) 22 | 23 | References: 24 | 1. https://arxiv.org/pdf/1505.04597.pdf 25 | 2. https://arxiv.org/pdf/1608.04117.pdf 26 | 3. https://towardsdatascience.com/understanding-semantic-segmentation-with-unet-6be4f42d4b47 27 | 4. https://towardsdatascience.com/review-u-net-biomedical-image-segmentation-d02bf06ca760 28 | -------------------------------------------------------------------------------- /Computer vision in Python/OpenCV/README.md: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /Computer vision in Python/Other Libraries/README.md: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /Computer vision in Python/README.md: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /Computer vision in Python/Tensorflow and Keras/README.md: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /Convolutional Neural Netwoks/Components and Working/README.md: -------------------------------------------------------------------------------- 1 | There are some basic components which are very essential to build a convolutional network. We are going to talk about them. 2 | 3 | ![Schematic](https://www.researchgate.net/publication/336805909/figure/fig1/AS:817888827023360@1572011300751/Schematic-diagram-of-a-basic-convolutional-neural-network-CNN-architecture-26.ppm) 4 | 5 | The above image gives a schematic representation of the parts and functioning of Convolutional Neural Networks. A CNN model consists of mainly two portions: 6 | 7 | 1. A Convolution part to extract the features. 8 | 2. A Fully Connected part to learn the features and deliver the final predictions. 9 | 10 | The Fully connected layer units are different from the Convoluional layer units as they are dense artificial neurons to learn from 1-dimensional flattened features. 11 | 12 | For several problems like segmentation problems, we remove the Fully connected classification layers from the CNN architecture and replace it using 1x1 convolutional layer to coupled with a sigmoid or a softmax activation to provide the output on a pixel wise basis. 13 | 14 | ### 1. Convolutional layers: 15 | As we have seen in the filter portion of the repository, filters are used to extract the features out of images. We have also seen, that different filters obtain different response from an image, thus highlighting different extracted features. Now, for different functions we need to extract different set of features, so we need different set of filters. Its practically impossible to select and tune the filter wieghts manually, so CNNs automates the tuning of the filter weights based on problem in hand and the dataset. 16 | 17 | So, in case of CNNs we initialize filters or kernels with random weights of size as decided by the designer, the filters extract features, which is used for predictions and the error in prediction is calculated, which backpropagates, The backpropagation is used to update the filter weights accordingly. 18 | 19 | ![Diag](https://miro.medium.com/max/685/0*awD7_-Oxmz2O_0bD) 20 | 21 | The diagram above showa the schematic flow of gradients in the layers. 22 | 23 | Till now we have used the terms "kernel" and "filter" for the same denotion, but at this stage the meanings are a bit different. Kernels are the weighted matrix like strutures that are convoluted with images. A group of kernel is called a filter. For convolutional layers,for each layer, a number of filter is passed to be convoluted with the image. The number and the dimension of the kernels building the filter at a certain layer is input from the user. Each of the kernels are different and they extract different feature maps (nD features) from an image. The dimension of all the feature maps are same as the dimension of the kernels are same. So, we stack the different feature maps obtained vertically, to create the actual output. 24 | 25 | So, for a particular image, if the dimension of the obtained feature map is m x m and we got n kernels for that layer. The output becomes m x m x n. So, the spatial dimension of an image is m x m, and the third dimension n, we call it **channels.** The image is now said to have n channels. 26 | 27 | Now, for layer 2, where the 3d feature map of size m x m x n comes in, it already has n channels, which actually is n 2d feature maps stacked together. In such cases, 1 kernel convolutes with all the channels of the input 3d feature map, which are 2d each. So the new 2d feature maps obtained as a result of the convolution for all the channels of the input map is of same dimension, and the new 2d feature maps obtained for that particular kernel is added up pixelwise to obtain that one final single 2d feature map. Similarly for every kernel in the filter in layer 2, we get a 2d feature map. So, say for p kernels,we get p 2d feature maps, which we stack up to create a 3d output feature map. 28 | 29 | Now, the size of the feature maps, depend on the size and movement of kernels over the images. The mentioned factors are dependent on some user dependent paramaters, like: padding, stride, and size of the kernel. 30 | 31 | The stride defines how the kernel moves. If it is 1x1, the kernel moves 1 pixel each along x and y axis respectively. It's often seen the points on the edges of an image is not very well extracted by the kernels, for this we use a padding of pixel value 0 around the image. If we keep padding="Same", the size of the image remains same before and after the convolution operator. 32 | 33 | ![convolution](https://miro.medium.com/max/526/0*Ttm3UqGvuWrPoqd2) 34 | 35 | ### 2. Pooling Layers: 36 | 37 | They are essential parts of a convolutional neural networks. The feature maps generated from convolutional layers are sometimes very large and are extremely computationally intensive. Pooling layers are used to reduce the dimension or downsample a convolved feature map. The pooling layer summarises the features present in a region of the feature map generated by a convolution layer. This makes the model more robust to variations in the position of the features in the input image. 38 | 39 | There are 2 main pooling layers used in CNNs: 40 | 41 | **1. Max-pooling:** Max pooling is a pooling operation that selects the maximum element from the region of the feature map covered by the filter. Thus, the output after max-pooling layer would be a feature map containing the most prominent features of the previous feature map. 42 | 43 | **2. Average-pooling:** Average pooling computes the average of the elements present in the region of feature map covered by the filter. Thus, while max pooling gives the most prominent feature in a particular patch of the feature map, average pooling gives the average of features present in a patch. 44 | 45 | Here also the size and stride is controlled by designer. 46 | 47 | ![pooling](https://miro.medium.com/max/700/0*CKZtMRLvJtP1HTs7) 48 | 49 | ### 3. Batch Normalization Layers: 50 | 51 | Batch normalization layers are used to speed up the learning rate and train in a much lesser number of epochs. Its often seen in deep neural networks, learning is done in mini-batches, and the distribution of mini batches is a random distribution, its not normal. In such cases the random distibution might cause a bad update of the weights and the weights might start to oscillate and not reach the optimum. This is why, batch normalization is done on the activations of the previous layers, to make the distribution a random one, with mean=0 and std=1, to reduce the chance and effect of a bad update. The condition of bad update is termed as a covariate shift in the weights. 52 | 53 | ### 4. Dropout layers: 54 | 55 | They are simple layers that drop or switch off some of the neurons to shutoff their activation to go further during the forward pass. This is a regualrizing method to prevent overfitting. 56 | 57 | ### 5. Flatten layers: 58 | 59 | These layers are just used to flatten the 2D or 3D feature maps into 1D verctors to be learned by the dense layers, 60 | 61 | ### 6. Dense layers: 62 | 63 | These layers are artificial neural network nodes, which learn from the obtained features. 64 | 65 | ### 7. Activation Layers: 66 | 67 | Activation layers are the units of non-linearity in the convolutional networks, which stack up to create polynomial functions to help the network learn complex functions. Most used Activation functions is "Relu" for CNNs. 68 | 69 | ### 8. Back propagation in CNN: 70 | 71 | 1. https://medium.com/@pavisj/convolutions-and-backpropagations-46026a8f5d2c 72 | 2. https://towardsdatascience.com/backpropagation-in-a-convolutional-layer-24c8d64d8509 73 | 74 | References: 75 | 76 | 1. https://medium.com/swlh/computer-vision-and-role-of-convolutional-neural-networks-explanations-and-working-8791b18b8c85 77 | 2. https://cs231n.github.io/convolutional-networks/ 78 | 3. https://medium.com/technologymadeeasy/the-best-explanation-of-convolutional-neural-networks-on-the-internet-fbb8b1ad5df8 79 | 4. https://towardsdatascience.com/batch-normalisation-in-deep-neural-network-ce65dd9e8dbf 80 | -------------------------------------------------------------------------------- /Convolutional Neural Netwoks/README.md: -------------------------------------------------------------------------------- 1 | ## Convlutional Neural Networks 2 | 3 | There have been many approaches to solve Computer Vision problems. The most popular approach is the application of deep learning architecture that is the use of Convolutional Neural Networks. 4 | 5 | ### Basic Idea 6 | 7 | Most of the fundamental operations in this field are dependent on the answer to a simple question “HOW SIMILAR ARE THE IMAGES?”. Say, for example, object detection or classification, an image set is given to a computer using which the computer has to label or detect a new image. This task seems pretty easy for humans because we can see. The computer answers this question by comparing how much the pattern of one image matches the other image pattern. By pattern, we mean the pattern in the array structure of the two images. 8 | 9 | ![CNN1](https://miro.medium.com/max/700/0*alvTaRr1j-IOnKGG) 10 | 11 | Here we can see that the X is rotated and as a result its array substructure changes. It becomes hard for the computer to get a 100% match, so it just compares and checks small patterns to decide if the two images match. 12 | 13 | ![CNN2](https://miro.medium.com/max/700/0*UJ-jwmPEKbVIFUZQ) 14 | 15 | We can see the patterns in the boxes match each other. So the computer can also judge these similarities and based on these it decides the answer to the basic question. These patterns are extracted by the computer as features of an image. Later, the computer tries to find these features in a new image it receives. If the same features are found in the new images it is classified as the target class else not, or if a part of the new image carries similar features as experienced in the training set, it is detected as an object. This is the basic idea behind convlutional Neural Networks. 16 | 17 | ### Convolution 18 | 19 | Convolution is a mathematical operation on two functions (f and g) that produces a third function expressing how the shape of one is modified by others. A second function g(t) is passed over a function f(t). The product of the two functions at a point t is found and the values are integrated to devise the third function. It is given by: 20 | 21 | ![formula](https://miro.medium.com/max/700/0*T_LEg6WuJ1t5AuXR) 22 | 23 | Reference: https://en.wikipedia.org/wiki/Convolution#:~:text=In%20mathematics%20(in%20particular%2C%20functional,the%20process%20of%20computing%20it. 24 | 25 | 26 | ![Structure](https://miro.medium.com/max/633/0*eRW5zbWkQM29Z6g7) 27 | 28 | The above image describes the basic structure of a convolutional neural networks. It constitutes of an input layer and an output layer and a number of hidden layers, which is dependent on the required complexity of the network according to the designer, in order to find and learn features from image data. 29 | -------------------------------------------------------------------------------- /Convolutional Neural Netwoks/Types of Convolutions/README.md: -------------------------------------------------------------------------------- 1 | ### Types of Convolutions 2 | 3 | As we have already seen, convolution layers are the chief and most important functional layers in the Convolutional Neural Networks. There are a number of different types of convolutional layers, with varying funtionalities. 4 | 5 | Based on dimensionality of applications, the convolutional layers, are divided into 2 types: 6 | 7 | **1. 2-D Convolutional layers:** These filters work on 2D input images and extract the required feature map. The size of the feature maps to be extracted is given by: 8 | 9 | ![Normal](https://miro.medium.com/max/435/1*1okwhewf5KCtIPaFib4XaA.gif) 10 | 11 | Feature Map Size= ((Image Size - Filter size + 2* Padding) / Stride) + 1 12 | 13 | The functionalities are performed on 2D units of the Images, termed as Pixels. 14 | 15 | We can preserve the size of output feature image same as the input image, using the padding parameter, padding ="same", and keeping the stride=1. 16 | 17 | We can also reduce the size of the ouput feature image compared to the input image, using the padding="valid" and stride=2. This is called **Strided Convolutions**, and is used in place of Pooling layers, to reduce the number of parameters and the size of the feature maps. 18 | 19 | **2. 3-D Convolutional layers:** These filters work on 3D input volumes and extract the required feature map. The size of the feature maps to be extracted is given by: 20 | 21 | The functionalities are performed on 3D units of the Images, termed as Voxels. 22 | 23 | Now, convolutions have been modified to obtain several versions from the normal versions, as discussed previously. 24 | 25 | ### Dilated Convolutions (Atrous Convolutions) 26 | 27 | ![Dilated](https://miro.medium.com/max/435/1*SVkgHoFoiMZkjy54zM_SUw.gif) 28 | 29 | Dilated convolutions were created, to obtain a larger size receptive field and at the same time, restricting the number of parameters. For example, in the above diagram, the kernel filter has 9 parameters, as of 3x3 kernel, but it's receptive field is same as a 5x5 kernel. It helps to obtain a broader field of vision. The above one is a 3x3 kernel, with a **dilation rate** of 2. The dilation rate is the parameter that controls how big is the kernel. For dilation rate n , n-1 zero rows and columns are inserted alternatively, after a wieght in a kernel. 30 | 31 | In dilated convolutions, also, similar to normal convolutionsm we can preserve the size of the output feature maps using padding= dilation rate. In several cases, dilated convolutions have been used to replace pooling layers, by using them with stride=2. 32 | 33 | ### Transposed Convolutions 34 | 35 | In cases like segmentation, we need to upsample from the final output feature maps. In such cases, we have used upsampling techniques like bilinear interpolation methods. But in order to upsample from so small feature maps,upsampling results in aliasing effects, and also causes loss in localized information. To deal with such problems, a learned convolution has been proposed. The Transposed convolutions or Deconvolutions, return the upsample image from a feature map. 36 | 37 | To achieve this transposed convolutions use a designed padding pattern and a normal convolutional kernel as shown here. 38 | 39 | ![Transpose](https://miro.medium.com/max/435/1*Lpn4nag_KRMfGkx1k6bV-g.gif) 40 | 41 | -------------------------------------------------------------------------------- /Fundamentals/Data Augmentation/README.md: -------------------------------------------------------------------------------- 1 | ### Data Augmentation 2 | 3 | Often in Computer vision problems, we need huge datasets to train deep learning models. But in some cases, mostly in medical imaging fields, the datasets are very hard to obtain owing to the amount of manual labour required to obtain the datasets. But on the other hand, we can not train our networks again and again based on the same data, as it would highly overfit the model. So, we need **data augmentation** to deal with the above problems. 4 | 5 | Data Augmentation is a method that let's us transform the already provided images in our dataset and create new image sets. The processes makes several changes randomly, to an image to create a new image that the model has not seen before. 6 | 7 | The common transformations used in case of Data Augmentation are: 8 | 9 | 1. **Mirroring:** We can just invert the whole image based on one given axis to obtain the new image. The inversion can be horizontal inversion or may be it can be vertical inversion. 10 | 2. **Random Cropping:** It crops portions of the image randmly, to create the new image. This is not a very good method, as it can omit important portion of the image which can cause loss in information. 11 | 3. **Rotation:** This transfomation rotates the image by a random amount within a certain range specified by the designer. 12 | 4. **Shearing:** This type of transformation provide shear to the image by a random amount within a certain limit. 13 | 5. **Colour Shifting:** We add some values to the values of the RGB channels to change the hue. It changes the colour distribution of the image. 14 | 15 | We use a combination of these operations to create a new image for training of the models. 16 | 17 | Resources: 18 | 1. https://www.analyticsvidhya.com/blog/2020/08/image-augmentation-on-the-fly-using-keras-imagedatagenerator/ 19 | 2. https://www.pyimagesearch.com/2019/07/08/keras-imagedatagenerator-and-data-augmentation/ 20 | 21 | Library and documentation: 22 | 23 | https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/image/ImageDataGenerator 24 | -------------------------------------------------------------------------------- /Fundamentals/Digital Image/README.md: -------------------------------------------------------------------------------- 1 | ### Digital Image 2 | 3 | Before we start learning about Image Processing, we need to learn about the fundamentals of digital imaging, their structures and operations. 4 | 5 | **Digital Image is the presentation of an image in the form it can be stores in a computer.** 6 | 7 | Normally, Image is 2 dimensional spatial signal, which is continous in nature, something as being shown by the below diagram. 8 | 9 | ![Image_2D](https://media.springernature.com/lw785/springer-static/image/chp%3A10.1007%2F978-981-10-2537-2_3/MediaObjects/428094_1_En_3_Fig7_HTML.gif) 10 | 11 | It is impossible to store such a continous 2D signal so, we adapt a few techniques. 12 | 13 | ### 1. Spatial Discretization 14 | 15 | ![Discretization](https://www.researchgate.net/profile/Sean-Wood-2/publication/44126806/figure/fig2/AS:452113244397568@1484803605822/Conceptual-depiction-of-discretization-in-time-and-amplitude-a-source-signal.png) 16 | 17 | As shown in the above figure, (a), we hava a continous signal, which we can't store. So, we sample the continous signal at some points as shown in (b), to obtain some discrete velues, such that for a particular (x) we have the value of the function f(x) represented by the curve. (c) gives the signal which can be reconstructed from the sampled signals. But for the reconstruction to be possible the sampling frequency must be >= 2 the bandwidth of the original signal. This is called **Nyquist Sampling Theory**. The method is called discretization. 18 | 19 | We saw the above example on a 1D data, but for storing an image we perform this operation on a 2D spatial data, and obtain discrete values all over the image plane (x,y), which represents the value of the function f(x,y) that represents that particular image signal. The values depict the amplitude of the signal at that point (x,y). The points of sampling all over the plane of the image are known as **Pixel Points** 20 | 21 | This whole process is called spatial discretization. 22 | 23 | ### 2. Controlling Resolution: 24 | 25 | Resolution is the measure of the number of pixel values, more the number of pixel points, more is the resolution. We have discussed in the spatial discretization portion, we sample the continous signal all along the 2D space, Now, if look very closely at (c) figure, we will see, that the signal values between the two sampling points are constants, and so looks a bit blockish. 26 | 27 | For 1D signal, we just needed to drop some 1D lines on the signals, to discretize it. For 2D image signals we distribute the image into small grids, at the centre of the grid is the sampling point, and the whole grid represent the pixels. 28 | 29 | ![Resolution](https://www.scientiamobile.com/wp-content/uploads/2018/12/Pixel-Density.jpg) 30 | 31 | As it can be seen in the above diagram, the same image has been discretized in two different manner. In the 1st image pixel density (unit: Pixel per inch) is much lower than the 2nd one. The amplitude recorded at the value of sampling spans over the full representing 2D pixel grid. As a result, the second image captures much more distinct and large number of pixel values than the first image. The first image provides much more of a blocky effect due to the reason we saw in the 1D case, and has a lower resolution and hence a lower clarity. 32 | 33 | ### 3. Quantization of Pixel Values: 34 | 35 | As we have already seen, image to be a constant 2D signal, it ranges over all Natural Numbers including decimal values. If we have to store decimal values, we need a large number of bits to save the amplitude of even one of the sampling points. For example, if the value of the signal is 78.93 at one point, it will take up a high number of bits, So, we use the concept of quantization. 36 | 37 | ![Quantization](https://www.tutorialspoint.com/digital_communication/images/quantization.jpg) 38 | 39 | The above image describes quantization of image signal, as we can see any signal value in the range of (a,b), is replaced by value a or value b depending on quantization policy. Now, in this image the axes represents number from 0-16 in binary. In image cases we create divisions from 0 to 255, total of 256 levels of quantizations, So, if we get a pixel value of 78.93 it is mapped as 78, which can be represented as a 8 bit number. So, for representing any pixel value we need 8 bits. The error generated due to quantization is called **quantization error**. More the number of levels, less is the quantization error, more is the memory required. It has been observed that, the quantization error effect is not significant if we use 256 levels of quantization. So, it has been adapted as standard,and the highest value of any pixel is 255. 40 | 41 | ### 4. Channels of an Image 42 | 43 | As we have seen in the Quantization portion, the pixel value varies from 0-255, and value represents the intensity of the pixel. If we consider a colour say white, 0-255 each value gives a different intensity of white, 0 being black and 255 being white, so we get a complete shade palette. Again, if we see another colour say red, each value from 0-255 gives a different shade of red. As we know, a every colour is a combination of some primary colours in case of images also, this theory holds. 44 | 45 | ![Channel](https://miro.medium.com/max/311/1*uMda2KGCQFb6sZm8Cli7tA.png) 46 | 47 | If you see the upper image, the first colour image is created by the super position of the other 3 images in primary colours Red, Blue and Green, pixel by pixel. So, the image is said to have 3 channels: Red, Blue and Green (RGB). These 3 channels are also present in diigital cameras. Grayscale images have only 1 channel, while RGB images of 3 channels. So, for saving an RGB image, we need 24 bits for each pixel, 8 bit per pixel for a channel each. 48 | 49 | 50 | 51 | 52 | -------------------------------------------------------------------------------- /Fundamentals/Edge Detectors/README.md: -------------------------------------------------------------------------------- 1 | ### Edge Detection 2 | 3 | Edge detection is one of the most significant tasks. The technique is used in image processing for tasks like Pattern Recognition and Feature extraction. 4 | 5 | Now, the idea behind how to detect edges is owing to the fact that the edges have much higher intensity value for the pixels lying on it. So, if we move across an edge, the intensity increases as we approach the edge and then it decreases as we move away from it. So, we use this particular feature foe edge detection. 6 | 7 | The best way of finding or detecting a continous change in pixel values as we move across the image is to focus on the gradient of the pixel values, as we move along the x or the y axis of the image plane. Some of the detectoes use first order derivatives for the detecting the change in gradient while others use second order derivatives for the reason. 8 | 9 | Some of the most used filters for Edge Detection are: 10 | 11 | 1. Prewitt Edge Detector 12 | 2. Sobel Edge Detector 13 | 3. Laplacian Edge Detector 14 | 4. Robert Edge Detector 15 | 5. LoG or Marr Hildreth Edge Detector 16 | 6. Canny Edge Detector. 17 | 18 | We have talked about Sobel and Laplacian in the Filters portion. 19 | 20 | **1. Prewitt Edge Detector:** The prewitt edge detectors are a couple of masks, one for detecting horizontal edges, other for vertical edges, obtained using the first order derivative along the Y and X axes repectively. 21 | 22 | ![Prewitt](https://miro.medium.com/max/541/1*q4Q2T52GyKi0tiF8fOUS1Q.png) 23 | 24 | **2. LoG or Marr Hildreth Edge Detector:** Laplacian operator works on double derivative of the flow of intensity of pixels as we move along the axes of the image plane. So, laplacian operators are very sensitive to sharp changes in intensity of the images. Owing to the above facts, laplacian operators are very sensitive to noise as well. So, we need to eliminate the noise using a noise masking operator. For this case, we choose a Gaussian Kernel. So, first we use the gaussian kernel to remove the noise, then we use the laplacian kernel to detect the edges. Using Function of function we obtain the LoG or Laplacian of Guassian filter, which we convolve with the image. 25 | It is given by: 26 | ![form1](https://i0.wp.com/theailearner.com/wp-content/uploads/2019/05/log1.png?resize=211%2C49&ssl=1) 27 | ![form2](https://i2.wp.com/theailearner.com/wp-content/uploads/2019/05/Lapeq.png?w=610&ssl=1) 28 | 29 | **3. Canny Edge Detector:** The canny edge detector is the most used edge detector of current times. The Canny edge detector works in four different steps: 30 | 1. Noise Removal: Like LoG, Canny Edge detector also uses a guassian filter mask to remove the noise from the image. 31 | 2. Gradient Computation: It calculates the gradient or the rate of change of intensity of pixels moving along the axes of the image. 32 | 3. Extract edges using Non-max supressions: Canny edge detector slides a kernel of size 3x3 over the image plane after calculating the gradients and tries to locate the local maxima, which are the edges detected 33 | 4. Hysteresis thresholding: Canny Edge detectors use a two threshold system, with an upper and a lower threshold, which just define a range of intensity. Its often seen that, noises are still not removed. So, canny detectors allows to set an intensity range within which the actual edge intensities should lie. 34 | 35 | Refernces: 36 | 1. cse.usf.edu/~r1k/MachineVisionBook/MachineVision.files/MachineVision_Chapter5.pdf 37 | 2. https://medium.com/analytics-vidhya/a-beginners-guide-to-computer-vision-part-2-edge-detection-4f10777d5483 38 | 3. https://theailearner.com/2019/05/25/laplacian-of-gaussian-log/ 39 | 4. https://ai.plainenglish.io/a-dive-into-canny-edge-detection-using-opencv-python-76326c2fa19c 40 | 41 | -------------------------------------------------------------------------------- /Fundamentals/Filters/README.md: -------------------------------------------------------------------------------- 1 | ### Filters: 2 | 3 | Filters and Kernals are basically image modifiers which act on a Pixel of Interest based on the Neighbouring Pixels. Please refer to Operations portion to have a clear idea about Neighbourhood. 4 | 5 | ![filters](https://www.researchgate.net/profile/Nilesh-Bahadure/publication/286442762/figure/fig10/AS:670370642268178@1536840224859/The-Mechanics-of-Spatial-Filtering.ppm) 6 | 7 | As shown in the figure above, the mask is called the kernel or filter. The filter shown is a 3x3 filter. Similarly, we can have 2x2, 5x5 and 7x7 filters, depending upon requirements of the neighbourhood. The (0,0) in the mask is coincides with the pixel of interest (x,y) of the image, f(x,y). Similarly the whole mask or filter just coincides with the pixels of the image, and gives a response according to the filter type. Now, the w(a,b) defines the weight of the filter at that particular index of the filter. The weight of the filter indices are dependent on the type of filter they are and the type of tasks they perform. 8 | 9 | Each filter has a response, which becomes the updated value of the pixel of interest at the center. The response is given by: 10 | 11 | **Response= f(x-1,y-1).w(-1,-1) + f(x-1,y).w(-1,0) + f(x-1,y+1).w(-1,1) + f(x,y-1).w(0,-1) + f(x,y).w(0,0) + f(x,y+1).w(0,1) + f(x+1,y-1).w(1,-1) + f(x+1,y).w(1,0) + f(x+1,y+1).w(1,1)** 12 | 13 | The filter slides over the entire image, and how the masks slides is controlled by a measure called **Stride**. 14 | 15 | Filters are normally used for image smoothing, sharpenining and edge enhancements. Filters can be present in both spatial and frequency domains. We are going to talk about the spatial domain filtering. 16 | 17 | ### 1. Smoothening or Blurring Filters: 18 | 19 | They smoothen out or blur the image upto some extent to reduce Noise. There are Linear and Non-Linear Smoothening operators. 20 | 21 | ##### Smoothening Linear Filters: 22 | 23 | **Mean Blur Kernel:** The response of this filter is just the mean value of the pixels in the neighbourhood of the pixel of interest, covered by the kernel. 24 | 25 | ![Mean](https://3.bp.blogspot.com/-z1EIsBG_8hA/U6KlbsJoBMI/AAAAAAAABdQ/jgThabZzWDM/s1600/avg.PNG) 26 | 27 | **Weighted Average Kernel:** The response of this kernel is similar to the Mean Blur kernel except the fact that this kernel takes the wieghted average of the neighbours. 28 | 29 | ![Weighted](https://miro.medium.com/max/432/1*rnHldrVN7DAqGX6fe_2IGQ.png) 30 | 31 | **Guassian Average Kernel:** Gaussian kernel is also another weighted mean blur kernel, whose weights follow the bell shaped curve Gaussian distribution. 32 | 33 | ![Gaussian](https://miro.medium.com/max/526/1*CHbVU4ykR7yWS3nd2ayrwg.png) 34 | 35 | The kernel distributions and weights can be varied using the mean ad the standard deviation according to requirements as shown in the implementations. 36 | 37 | ##### Smoothening Non Linear Filters: 38 | 39 | The non-linear filters mostly consists of three types of filters, the MIN, MAX and the MEDIAN filters, which are used for smootheining. Accordingly, as their name suggests, the filters return the Maximum, Minimum and Median values of the pixel values covered by the filters over the image. 40 | 41 | 42 | ### 2. Sharpening Filters: 43 | 44 | Such filters are used to increase the internsity of the edges and detect them. 45 | 46 | **Laplacian Filter**: THese filters are created usese the double derivative on the increasing or decreasing flow of the pixel values of the image in both vertical and horizontal directions. There are two types of laplacian operators: 47 | 48 | 1. Without Considering Diagonals: Using 4-Neighbours. 49 | 2. Considering Diagonals: Using 8-Neighbours. 50 | 51 | ![Laplacian](https://www.researchgate.net/profile/Siti-Yasiran/publication/261459927/figure/fig1/AS:650032764170264@1531991296526/Two-commonly-used-discrete-approximations-to-the-Laplacian-filter.png) 52 | 53 | **Sobel Filters** They are used for outlining edges, with the help of gradient of the movement of pixel values in an image to enhance edges. There are two sobel operators: For Vertical and for Horizontal Edge detections. 54 | 55 | ![Sobel](https://miro.medium.com/max/2400/1*djX8CkhUJZYu9ndKyz5ApA.jpeg) 56 | 57 | 58 | 59 | -------------------------------------------------------------------------------- /Fundamentals/Noisy image handling/README.md: -------------------------------------------------------------------------------- 1 | ### Noise in image processing: 2 | 3 | Before startiing to learn about Noise Filtering in image processing, we must take a look at the noises that exist in images and what noise actually is. 4 | 5 | Suppose I(x,y) denotes the intensity of the pixel (x,y) in an image.The Intensity is made of two components: value of the image signal F(x,y) at the pixel concerned and the value of noise at the pixel N(x,y). 6 | 7 | ##### I(x,y) = F(x,y) + N(x,y) 8 | 9 | Noise is a random variation of brightness or colour information in images. It obsecures the desired value of the image signal function and thus the information. 10 | 11 | There are two types of noise models: 12 | 13 | 1. Additive noise 14 | 2. Multiplicative noise. 15 | 16 | Other than this there are basic 4 types of noise which are very common in images. 17 | 18 | **1. Gaussian Noise:** The probability Density Function of this type of noise follows the Normal distribution or guassian distribution with mean *nu* and standard deviaton *sigma* 19 | 20 | **2. Salt and Pepper Noise:** This type of noise has two values for each pixel: 0(black->pepper) and 255(white->salt) for 8-bit images. They cause sharp and sudden disturbance in the image signal. 21 | 22 | **3. Poisson's Noise:** The probability Density Function of this type of noise follows the Poisson distribution 23 | 24 | **4. Speckle Noise:** It exists naturally and can be created by randomly multiplying random pixel values with different pixels of an image. 25 | 26 | ### Image Denoising: 27 | 28 | There are several ways to remove noise from images, some of them are spatial domain methods, and some of them are in the frequency domain, which need to be deployed after conducting a Fast Fourier Treansform, to convert the spatial domain into frequency domain. Some Noises best suite some specific denoising methods. 29 | 30 | Some basic techniques of Image Denoising are: 31 | 32 | 1. Mean filter, Weighted Averaging filters 33 | 2. Median Filter 34 | 3. Gaussian Filter 35 | 4. Weiner Filter. 36 | 37 | We discussed first three while discussing Filters. Now we move to Weiner Filter. 38 | 39 | ##### Weiner Filter. 40 | 41 | This filter tries to minimiza the mean squared error (MSE) between the noisy image and the original image signal. 42 | 43 | The equation of Weiner filter obtained while reducing MSE error is - 44 | 45 | ![Weiner](https://iq.opengenus.org/content/images/2019/07/Untitled-Diagram--14-.png) 46 | 47 | Where, 48 | 49 | ||H(x, y)||^2 = H(x, y)H*(x, y); 50 | H(x, y) = degradation function; 51 | H*(x, y) = complex conjugate of degradation function; 52 | Sn(x, y) = ||N(x, y)||^2 i.e. Power spectrum of Noise; 53 | Sf(x, y) = ||F(x, y)||^2 i.e. Power spectrum of undegraded(original) image; 54 | 55 | References: 56 | 1. https://medium.com/image-vision/noise-in-digital-image-processing-55357c9fab71#:~:text=There%20are%20three%20types%20of,value)%20all%20over%20the%20image. 57 | 2. https://medium.com/image-vision/noise-filtering-in-digital-image-processing-d12b5266847c 58 | 3. https://pub.towardsai.net/image-de-noising-using-deep-learning-1a8334c81f06 59 | 4. https://towardsdatascience.com/introduction-to-image-denoising-3e269f176483 60 | 5. https://iq.opengenus.org/image-denoising-and-image-processing-techniques/ 61 | 6. https://www.ijcaonline.org/volume9/number4/pxc3871846.pdf 62 | 63 | 64 | 65 | 66 | 67 | -------------------------------------------------------------------------------- /Fundamentals/Operations/Global Operations/README.md: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /Fundamentals/Operations/Local Operations/README.md: -------------------------------------------------------------------------------- 1 | ### Local Operations: 2 | 3 | Local operations in image processing are dealt with using filters. Please refer to the Filters section for more details 4 | -------------------------------------------------------------------------------- /Fundamentals/Operations/Pointwise Operations/README.md: -------------------------------------------------------------------------------- 1 | ## Pointwise operations 2 | 3 | For the application of these operaors, we will consider each amplitude value of a pixel as X. The operations are pointwise, so the transformations are pixelwise. The transformation can be denoted as 4 | 5 | ##### Y=T(X) 6 | 7 | where X is the original pixel value and Y is the new pixel value after transformation for the same pixel. 8 | 9 | ### 1. Inversion 10 | 11 | This operator converts the image into its negative or inverse image, as shown in the image_transformation_1.ipynb file. The transformation formulates as below: 12 | 13 | **Y= (max_pixel_value-1) - X** 14 | 15 | It is a completely linear transformation. 16 | 17 | ### 2. Image Thresholding 18 | 19 | The operator make all the pixel values above a particular threshold equal to the maximum value and all the pixel values below a particular threshold equal to the minimum value. Its working has been shown in the image_transformation_2.ipynb file 20 | 21 | ### 3. Context-Stretching: 22 | 23 | It is an image enhancement technique. There are several types of Context stretching operations: 24 | 25 | ##### MinMax Context Stretching: 26 | 27 | It is achieved by the equation: 28 | 29 | **P_new(i,j)= ( (P_old(i,j) - min(P_old) ) / ( max(P_old) - min(P_old))* 255** 30 | 31 | where P_old is the given image, and max(P_old) depicts the highest pixel value. 32 | 33 | ##### Percentile Stretching 34 | 35 | ### 4. Grayscale slicing: 36 | 37 | The operator make all the pixel values above a particular threshold equal to the maximum value and all other pixel values remain intact. Its working has been shown in the image_transformation_3.ipynb file 38 | 39 | ### 5. Logarithmic Stretching: 40 | 41 | This enhancement technique is used a lot in FFTs. It is given by: 42 | 43 | **Y= C.log(1+x)**, where C is a constant 44 | 45 | ### 6. Power-law transform: 46 | 47 | This enhancement technique is given by: 48 | 49 | **Y= C.(X+*lambda*)^*gamma*** 50 | 51 | Here gamma is the exponential parameter, that gives a number of variants based on its value. C is the constant. 52 | 53 | ### 7. Histogram Equalization: 54 | 55 | It is a simple technique to improve the contrast of an image and enhance it. It is often seen that, in images pixel values get confined in a particular range, for example, for bright images, the pixel values are mostly confined to higher value region. This results in a low contrast image. 56 | 57 | ![Equalization](https://repository-images.githubusercontent.com/196091044/195dce80-a2b1-11e9-8028-f8688944258f) 58 | 59 | Histogram equilization spreads the pixel values over the range as shown in the above diagram, increasing the image contrast, by creating an equal probable distribution for all the pixel values to occur. The process can be achieved using opencv, easily as shown in the fourth module pynb file. 60 | 61 | 62 | 63 | 64 | -------------------------------------------------------------------------------- /Fundamentals/Operations/README.md: -------------------------------------------------------------------------------- 1 | ### 1. Neighbourhood of an Image Pixel. 2 | 3 | In order to have a fundamental knowledge about operations of image processing we must have the knowledge of neighbourhood of a pixel. 4 | 5 | ![Neighbourhood](https://www.researchgate.net/profile/Ana-Siravenha/publication/221317837/figure/fig1/AS:305627109838848@1449878589041/N-aN-pixels-neighborhooda-4-neighborhood-b-8-neighborhood-9.png) 6 | 7 | The above images show the two types of neighbourhood there exists. The centre pixel is the **Pixel of Interest (x,y)** in both cases. The first image (a) of the above figures shows **N-Four Neighbours**, and the second image (b) of the figures show **N-Eight Neighbours** respectvely. It also shows how they are referred with respect to the pixel of interests as (x+n,y+n), n being 0 or 1 based on the adjacency of the neighbour from the POI 8 | 9 | 10 | ### 2. Operations On an Image 11 | 12 | There are several image transformations which take in an image as a 2D function of (x,y), say input image as, F1(x,y) and produce output image signal as function F2(x,y). Overall these transform operations can be divided into three basic types: 13 | 14 | **1. Pointwise Operations:** The output value at a specific pixel is dependent only on the input value at that same pixel. 15 | 16 | **2. Local Operations:** The output value at a specific pixel is dependent on the input values in the neighborhood of that same pixel. 17 | 18 | **3. Global Operations:** The output value at a specific coordinate is dependent on all the values in the input image.. 19 | -------------------------------------------------------------------------------- /Generative Networks/Generative Adversarial Networks/README.md: -------------------------------------------------------------------------------- 1 | ### Generative Adversarial Networks 2 | 3 | Explanations of GAN can be found here: https://towardsdatascience.com/introduction-to-generative-networks-e33c18a660dd 4 | 5 | References: 6 | 7 | 1. https://arxiv.org/pdf/1406.2661.pdf 8 | 2. https://towardsdatascience.com/understanding-generative-adversarial-networks-gans-cd6e4651a29 9 | 10 | Step by step guide by Google developers: 11 | 12 | https://developers.google.com/machine-learning/gan 13 | -------------------------------------------------------------------------------- /Generative Networks/README.md: -------------------------------------------------------------------------------- 1 | ### Generative Network: 2 | 3 | Codes: 4 | 5 | 1. https://github.com/abr-98/GAN-Models 6 | 2. https://github.com/eriklindernoren/Keras-GAN 7 | 8 | Papers: 9 | 10 | 1. GAN: https://arxiv.org/pdf/1406.2661.pdf 11 | 2. DCGAN: https://arxiv.org/pdf/1511.06434.pdf 12 | 3. WGAN: https://arxiv.org/pdf/1701.07875.pdf 13 | 4. WGAN-GP:https://arxiv.org/pdf/1704.00028.pdf 14 | 5. DCGAN_tut: https://arxiv.org/pdf/1701.00160.pdf 15 | 6. CGAN: https://arxiv.org/pdf/1411.1784.pdf 16 | 7. SGAN: https://arxiv.org/pdf/1606.01583.pdf 17 | 8. SGAN_wasserstein: https://ieeexplore.ieee.org/document/8914286 18 | 19 | Resources: 20 | 21 | 1. https://lilianweng.github.io/lil-log/2017/08/20/from-GAN-to-WGAN.html 22 | 2. https://jonathan-hui.medium.com/gan-gan-series-2d279f906e7b 23 | 3. https://machinelearningmastery.com/practical-guide-to-gan-failure-modes/ 24 | 4. https://jonathan-hui.medium.com/gan-why-it-is-so-hard-to-train-generative-advisory-networks-819a86b3750b 25 | 5. https://towardsdatascience.com/dcgans-deep-convolutional-generative-adversarial-networks-c7f392c2c8f8 26 | 6. https://machinelearningmastery.com/how-to-develop-a-conditional-generative-adversarial-network-from-scratch/ 27 | 7. https://medium.datadriveninvestor.com/an-introduction-to-conditional-gans-cgans-727d1f5bb011 28 | 8. https://ai.plainenglish.io/review-cgan-conditional-gan-gan-78dd42eee41 29 | 9. https://towardsdatascience.com/semi-supervised-learning-with-gans-9f3cb128c5e 30 | 10. https://medium.com/@jos.vandewolfshaar/semi-supervised-learning-with-gans-23255865d0a4 31 | 32 | 33 | 34 | -------------------------------------------------------------------------------- /Image Feature Extractors/Feature Pyramid Network/README.md: -------------------------------------------------------------------------------- 1 | ### Feature Extraction in Deep Learning 2 | 3 | We have discussed a lot about traditional feature extractors, similarly, there have been attempts to develop feature extractors, using deep learning as well. One of the extracting models very popularly used is the **Feature Pyramid Network**. The concept of Pyramid networks is widely used in most classification and segmentation problems. 4 | 5 | The Pyramid Networks are based on the idea that if we apply convolutions, in order to extract features, from the same image at different spatial resolutions, each of the resolutions, will provide different feature maps that will focus on different views and fields of the image. Thus, we can obtain a much better and precise feature representation of the image. 6 | 7 | Initially, the idea was implemented through the creation of the image pyramid. 8 | 9 | ![Pyramid](https://miro.medium.com/max/600/1*UAee9W6LRTIYT0ygCFAdmw.png) 10 | 11 | As shown in the above figure, the actual image, is upscaled to double the size and downscaled to half the size, in order to create the pyramid. Now, here the catch is downscaling causes a loss in information, so, prior to that a guassian filtering is use, to spread the information. 12 | 13 | The operations of upscaling and downscaling can be performed using opencv in python. But using convolutional layers on so many layers seperately is a very time consuming and costly task, to solve this issue FPN or feature Pyramid Networks were introduced. 14 | 15 | ![Feature_P_1](https://miro.medium.com/max/1380/1*D_EAjMnlR9v4LqHhEYZJLg.png) 16 | 17 | FPN uses the multi-scale feature maps from different layers. It has a two pathway system, a bottom-up pathway and a corresponding top-down pathway. The bottom-up path way, uses a deep convolutional network backbone like, VGG-19 or Resnet. The bottom up pathway provides the rich low resolution feature maps from the image. Every stage in the bottom-up layer processes the image, and reduces the resolution by half before sending it too the next layer, with application of a pooling layer. 18 | 19 | The bottom up layer help to extract the best semantic features, while the top down layer, upsamples the feature maps at every stage to get the localized features, that can be found from more high resolution images. Every stage in the top-down limb upsamples the feature map by 2, using nearest distance weighted interpolation method. The upsampling layers in known to create an aliasing effect, so, an extra input from the corresponding layer from the bottom up limb is added after upsampling using a lateral connection. The input from the bottom up limb is passed through a 1x1 convolution, in order to resize the number of channels as, to add we need the two inputs to have same dimension and number of channels. The addition helps to focus on smaller and finer features while processing. 20 | 21 | We extract features from all the levels of the top down limb in order to obtain multi scaled features. So, we obtain 4 seperate feature maps, one from each layer, which are flattened and passed into the fully connected layers for classification tasks. 22 | 23 | References: 24 | 1. https://arxiv.org/pdf/1612.03144.pdf 25 | 2. https://medium.com/analytics-vidhya/a-beginners-guide-to-computer-vision-part-4-pyramid-3640edeffb00 26 | 3. https://docs.opencv.org/3.4/d4/d1f/tutorial_pyramids.html 27 | 4. https://towardsdatascience.com/review-fpn-feature-pyramid-network-object-detection-262fc7482610 28 | 5. https://jonathan-hui.medium.com/understanding-feature-pyramid-networks-for-object-detection-fpn-45b227b9106c 29 | 30 | An example shown in the file. 31 | 32 | Label: 0 -> cat, 1 -> dog 33 | 34 | The file structure:\ 35 | 36 | Train\ 37 | |- Cat\ 38 | |- Dog\ 39 | Validate\ 40 | |- Cat\ 41 | |- Dog\ 42 | Test\ 43 | |- Cat\ 44 | |- Dog 45 | 46 | The designed network is a feature pyramid network. 47 | 48 | 49 | 50 | -------------------------------------------------------------------------------- /Image Feature Extractors/README.md: -------------------------------------------------------------------------------- 1 | ### Feature Extraction 2 | 3 | Feature extraction the idea of extraction of some important useful features to define the characteristic of a data. The extracted features are fed to the machine learning algorithms to make a classification or prediction. Selecting a good set of features help to increase the predictive power of the model. 4 | 5 | In image data, the measurable features are distinct colour or specific shape like, corner, line, edge or an image segment. Good set of features extracted from images can help machine learning algorithms to accomplish image comparison and object detection tasks. [WIKI](https://en.wikipedia.org/wiki/Feature_(computer_vision)#Extraction) 6 | 7 | Machine Learning models like SVM, draw important conclusions from these features which help them to make decision,so, to use these hand-crafted feature importancenis necessary. 8 | 9 | ![ML](https://miro.medium.com/max/700/0*SVNmDa8IGVle0ue9.png) 10 | 11 | Deep learning algorithms on the other hand, are built-in feature extractors + decision makers. They do not need hand crafted feature extraction filters. They automatically extract features and learn their importance bt applying weights to the connection, using backpropagation in Neural Networks. 12 | 13 | ![DL](https://miro.medium.com/max/700/0*LYa9apx25QSsRgj5.png) 14 | 15 | In order to use state of the art ML algorithms to make predictions on image data, we have several traditional feature extractors. These extractors are present in the OpenCV libraries of python. Apart from Traditional extractors, we can also use the pretrained models as feature extractors using transfer learning, by removing the top softmax layer. The pre-trained models outputs the feature maps in the shape 7 x 7 x N where N is the number of channels, for a given image, in most cases. We need to use a flatten to obtain the training data for the predictive ML model for an image. 16 | 17 | Reference Example: 18 | 1. https://www.pyimagesearch.com/2019/05/27/keras-feature-extraction-on-large-datasets-with-deep-learning/ 19 | 2. https://liu.diva-portal.org/smash/get/diva2:1151145/FULLTEXT01.pdf 20 | 21 | In case of traditional feature extractors, we need to apply the available filters on the image to extract the feature maps from the image. We need to flatten the feature maps to create dataset in order to train the traditional ML models. The filters used in traditional extractors are designed with fixed weight kernels to highlight specific features like, edges, corners and lines. Some of the most used traditional extraction methods are: 22 | 23 | **1. Histogram of Oriented Gradients (HOG):** It is one of the most important feature descriptors, which is very useful for training algorithms like SVM. It calculates the pixel gradients, i.e, the rate of change in pixel values along the x and y axis, and the direction of the gradient. The magnitude of gradient helps to detect the edges and corner and the direction gives information about the shape. The HOG descriptor is used in patches in multiple scales cropped from the original image in the aspect ratio of 1:2. 24 | 25 | Reference Links: 26 | 1. https://stackoverflow.com/questions/6090399/get-hog-image-features-from-opencv-python 27 | 2. https://learnopencv.com/histogram-of-oriented-gradients/ 28 | 3. https://www.analyticsvidhya.com/blog/2019/09/feature-engineering-images-introduction-hog-feature-descriptor/ 29 | 30 | **2.Haar Cascades:** Object Detection using Haar feature-based cascade classifiers is an effective object detection method proposed by Paul Viola and Michael Jones in their paper, “Rapid Object Detection using a Boosted Cascade of Simple Features” in 2001. Haar cascades are presented as a cascade of classifiers. The final classifier is a weighted sum of weak classifiers. These weak classifiers can’t alone classify an image, but together form a strong classifier. We know that, in the case of convolutions, the kernels are devised during training but here they have a set of Haar features or filters that are fixed which are applied on the training images to extract features from them. These filters or kernels of different sizes are passed over the images and extract a huge number of features from images that are used to classify the target objects. OpenCV has a trainer and a detector that can use Haar Cascades. 31 | 32 | ![Filters](https://miro.medium.com/max/489/1*AeAYGJkO_tNhjhI0ajFwTQ.png) 33 | 34 | There are several filters to detect different features. They are majorly used for face detection. 35 | 36 | References: 37 | 38 | 1.https://github.com/opencv/opencv/tree/master/data/haarcascades \ 39 | 2.https://docs.opencv.org/3.4/db/d28/tutorial_cascade_classifier.html \ 40 | 3.https://www.pyimagesearch.com/2021/04/12/opencv-haar-cascades/ 41 | 42 | **3.SIFT: Scale Invariant Feature Transform:** In this method, we locate features called "keypoints" on the image, which are invariant to rotation, noise, scaling. Using the gradient magnitude and direction, at the keypoints we create unique footprints for the objects in the image. These footprints are used in case of object identification, matching and detection. In order to find the "keypoints", we use a guassian blur to remove noise, a scale space inorder to obtain the scale invariant points. Next, we use a Differnece of Gaussian (DOG) method to extract the points of interest. We then proceed to an orientation assignment operation, in order to check if the points are rotation invariant. Lastly, using the combination of values, we develop an unique fingerprint for every keypoint called the feature descriptor. 43 | 44 | References: 45 | 1. https://www.analyticsvidhya.com/blog/2019/10/detailed-guide-powerful-sift-technique-image-matching-python/ 46 | 2. https://docs.opencv.org/3.4/da/df5/tutorial_py_sift_intro.html 47 | 3. https://link.springer.com/article/10.1023/B:VISI.0000029664.99615.94 48 | 49 | 50 | **4. SURF: Speeded-Up Robust Feature:** It is a faster version of SIFT. 51 | 52 | References: 53 | 1.https://docs.opencv.org/3.4/df/dd2/tutorial_py_surf_intro.html \ 54 | 2.https://medium.com/data-breach/introduction-to-surf-speeded-up-robust-features-c7396d6e7c4e \ 55 | 3.https://link.springer.com/chapter/10.1007/11744023_32 56 | 57 | Other most used Feature exractors are: \ 58 | BRIEF: https://docs.opencv.org/3.4/dc/d7d/tutorial_py_brief.html \ 59 | ORB: https://docs.opencv.org/3.4/d1/d89/tutorial_py_orb.html \ 60 | FAST: https://docs.opencv.org/3.4/df/d0c/tutorial_py_fast.html 61 | 62 | More Resources: 63 | 64 | 1. https://arxiv.org/pdf/1710.02726.pdf 65 | 2. https://ieeexplore.ieee.org/document/8346440 66 | 3. https://towardsdatascience.com/image-feature-extraction-traditional-and-deep-learning-techniques-ccc059195d04 67 | 4. https://medium.com/machine-learning-world/feature-extraction-and-similar-image-search-with-opencv-for-newbies-3c59796bf774 68 | 5. https://medium.com/data-breach/introduction-to-surf-speeded-up-robust-features-c7396d6e7c4e 69 | 70 | 71 | 72 | 73 | 74 | 75 | 76 | 77 | 78 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2021 Abhijit Roy 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Guide-to-Computer-Vision 2 | A step by step guide to Computer vision 3 | 4 | ![Vision](https://github.com/abr-98/Guide-to-Computer-Vision/blob/main/Vision.png) 5 | 6 | 7 | 8 | Computer Vision has emerged as a majorly developing field of research over the years. The introduction of Convolutional Neural Networks along with the support of sophisticated softwares and high computational powers contributed to make Computer Vision one of the most important fields in a huge number of application fields that include self-driving automobiles, facial data based applications, security systems, and biomedical imaging. 9 | 10 | The repository aims to provide a full step by step guide to the field of computer vision. 11 | 12 | #### Feel free to fork this repository and pull requests. 13 | #### Please comment on code, if present 14 | #### Each portion of computer vision has a dedicated folder: Each folder has a README. 15 | 16 | 17 | Please Visit: https://medium.com/@myac.abhijit/a-beginners-guide-to-computer-vision-3a8aea9304cd 18 | -------------------------------------------------------------------------------- /Real life Implementations/3D model generation/README.md: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /Real life Implementations/Image based search/README.md: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /Real life Implementations/Intelligent Automobiles/README.md: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /Real life Implementations/Medical Imaging/README.md: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /Real life Implementations/Optical Character Recognition/README.md: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /Real life Implementations/Recognition and Biometrics/README.md: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /Real life Implementations/Survelliance/README.md: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /Special Networks/README.md: -------------------------------------------------------------------------------- 1 | There are several networks apart from the main stream network models, which have been developed to acknowledge specific challenges in the field of computer vision and attain specific tasks. We talk about such models in this section. 2 | -------------------------------------------------------------------------------- /Special Networks/Siamese Network/README.md: -------------------------------------------------------------------------------- 1 | ### Siamese Network 2 | 3 | The idea of siamese network was first proposed by Lev V. Utkin in the paper titled "An explanation method for Siamese neural networks". The Main focus of the Siamese Network was to deal with classifications where there is a shortage of data, or a severe imbalance, like face detection or signature detection mechanisms. It is impossible to obtain large number of data. 4 | 5 | The SNN basically works on an autoencoder based idea. It passes two images through two similar networks, i.e, same architecture, and weights so two corresponding feature maps are obtained. Then the similarity between the obtained feature maps is checked. If they difference is less, the objects are similar and vice versa. 6 | 7 | #### Architecture 8 | 9 | ![Arch](https://miro.medium.com/max/770/1*I7a9aVN2poHUtiHSq2q44Q.png) 10 | 11 | The above image gives the architecture. So, we have two models but they are same in weights and structure. It can be said that they are instances of the same model. On being passed two images, it returns 1 if they are similar, else returns 0. Siamese networks convert the classification problem into a similarity problem, and learns from semantic similarity, as the features are obtained from same semantic embeddings. 12 | 13 | #### Losses 14 | 15 | The model is trained o minimize the distance between samples of the same class and increasing the inter-class distance. The most used losses are: 16 | 17 | 1. **Contrastive loss:** In Contrastive loss, pairs of images are taken. For same class pairs, distance is less between them. For different pairs, distance is more. This loss is used to learn embeddings in which two similar points have a low Euclidean distance and two dissimilar points have a large Euclidean distance. It is given by: 18 | 19 | 20 | **L = Y * D^2 + (1-Y) * max(margin — D, 0)^2** 21 | 22 | D is the distance between image features. ‘margin’ is a parameter that helps us pushing different classes apart. 23 | 24 | 1. **Triplet loss:** The model takes three inputs- anchor, positive, and negative. The anchor is a reference input. Positive input belongs to the same class as anchor input. Negative input belongs to a random class other than the anchor class. The idea behind the Triplet Loss function is that we minimize the distance between the anchor and the positive sample and simultaneously also maximize the distance between the anchor and the negative sample. It is given by: 25 | 26 | **Loss L= max(d(a,n) — d(a,p), 0)** 27 | 28 | where d is the distance metrics, a is the reference, p is the positive example and n is the negative. So, in order to maximize this we need to have d(a,p) < 0 and d(a,n) to have a positive high value. 29 | 30 | References: 31 | 32 | 1. https://arxiv.org/pdf/1911.07702.pdf 33 | 2. https://towardsdatascience.com/one-shot-learning-with-siamese-networks-using-keras-17f34e75bb3d 34 | 35 | Application: 36 | 37 | https://ai.plainenglish.io/an-approach-towards-building-a-gui-based-and-voice-controlled-security-lock-mechanism-based-on-f06b47d0426e 38 | -------------------------------------------------------------------------------- /Transfer Learning/DenseNet/README.md: -------------------------------------------------------------------------------- 1 | ### Densenet 2 | 3 | Densenet was first proposed in 2017 by a joint team of Tsinghua university, Cornell University and Facebook AI Research. 4 | 5 | The idea of densenet is driven as a modified idea of Pre-activated Resnet with much more stronger gradient flow. Densenet modified the resnet module completely. In the basic version, the authors ensured, that while extracting, each layer has the additional inputs from all the previous layers and passes on the feature maps. The feature maps are concatenated in Densenets instead of adding the gradients as in Resnets. 6 | 7 | ![Densenet](https://miro.medium.com/max/770/1*rmHdoPjGUjRek6ozH7altw.png) 8 | 9 | The above image represents the structure of Densenets. Each layer recieves "collective Knowledge" of all previous layers. Now this arrangements lets us use a controlled number of channels, given by growth rate k. K is the number of channels added for each layer and the concatenated feature maps move forward. 10 | 11 | ![Concat](https://miro.medium.com/max/660/1*9ysRPSExk0KvXR0AhNnlAA.gif) 12 | 13 | The module was then modified in a format of the preactivated Resnet having, batch normalization, ReLU, a 3x3 convolution layer and to reduce rhe complexity and size was additionally convoluted with a 1x1 filter for each block in a layer. 14 | 15 | ![final](https://miro.medium.com/max/770/1*dniz8zK2ClBY96ol7YGnJw.png) 16 | 17 | #### Architecture 18 | 19 | ![Arch](https://miro.medium.com/max/770/1*BJM5Ht9D5HcP5CFpu8bn7g.png) 20 | 21 | Three blocks of Densenet module is used in the final architecture, with a transition block, containing a 1x1 convolution and an average pooling layer. At the end of the last dense block, a global average pooling is performed and then a softmax classifier is attached. 22 | 23 | Reference: 24 | 25 | 1.https://towardsdatascience.com/review-densenet-image-classification-b6631a8ef803 \ 26 | 2.https://arxiv.org/pdf/1608.06993.pdf 27 | 28 | More at: 29 | https://towardsdatascience.com/densenet-2810936aeebb 30 | -------------------------------------------------------------------------------- /Transfer Learning/EfficientNet/README.md: -------------------------------------------------------------------------------- 1 | ### Efficient Net 2 | 3 | EfficientNet was proposed in 2019 by Google. The model was trained on the Imagenet dataset. The paper focused on two elements majorly: 4 | 1. To find a mobile-sized baseline architecture: EfficientNet-B0 5 | 2. It puts forward an effective compound scaling methods to achieve more robust model efficiency and performance. 6 | 7 | It primarily focuses on reducing FLOPS (floating point operations per second) and the parameters used in the models. First, we need to look at the concept of scaling. 8 | 9 | We have seen that previously, we have scaled up and down the models simply by stacking layers of convolutional modules on availablity of resource. The authors proposed a different concept of model scaling. They introduced three type of scaling: 10 | 11 | 1. Depth-wise scaling: The scaling in depthwise dimension is done by simply adding or removing convolutional layers. So, it concerns with the number of layers in the convolutions in the network. Deeper the network, more it can capture richer and more complex features. It is the common intution, but experiments show that it is not always true. 12 | 13 | 2. Width-wise scaling: The scaling in width or lateral-wise is done by adding more number of filters in the convolution layers. Width is represented by the number of filters. More wider the network, more fine-grained and localized features can be captured. Research has shown, after a certain increase the accuracy saturates. 14 | 15 | 3. Resolution scaling: The scaling deals with the dimension of the image input. The larger the dimension, more is the resolution, and it is said to better represent the fine-grained features. 16 | 17 | ![scale](https://miro.medium.com/max/700/1*xQCVt1tFWe7XNWVEmC6hGQ.png) 18 | 19 | The paper also showed that, increasing one of the three dimensions didn't have any effect on the accuracy. Intuitively, if we increase the width and not the depth, the network will not be able to learn the found features. So, we must gain a coumpound combined scaling, in all the fields together, to increase the accuracy. 20 | 21 | To achieve this, the authors framed the problem, as an effective scaling technique using a compound coefficient ɸ to uniformly scale network, as: 22 | 23 | depth: d= α^ɸ 24 | width: w= β^ɸ 25 | resolution: r= γ^ɸ 26 | 27 | such that, α x β^2 x γ^2 =2 (approx) 28 | and all the parameters are greater than 1. 29 | 30 | ɸ is a user-specified coefficient that controls how many resources are available whereas α, β, and γ specify how to assign these resources to network depth, width, and resolution respectively. 31 | 32 | #### Architectiure 33 | 34 | Now, for the providing the results of comparison of the different scaling, the authors needed to find a perfect base model which can be scaled to give the required results. 35 | 36 | The authors used a Neural Architecture Search algorithm, M-NASNet to search the required model space and find the best fit model. 37 | 38 | ![Arch](https://1.bp.blogspot.com/-DjZT_TLYZok/XO3BYqpxCJI/AAAAAAAAEKM/BvV53klXaTUuQHCkOXZZGywRMdU9v9T_wCLcBGAs/s640/image2.png) 39 | 40 | The above architecture was proposed as the basic base architecture. The MBConv blocks are inverted Residual blocks as used in MobileNetV2 with squeeze and excite blocks. Varying the parameters α, β, γ, and ϕ the versions of the Efficient nets were found from B1-B7. The best efficient net is found to be B0, with α =1.2, β = 1.1, ϕ =1 and γ = 1.15. 41 | 42 | ![Module](https://amaarora.github.io/images/mbconv.png) 43 | 44 | The above image represents the MBConv module. 45 | 46 | References: \ 47 | 1.https://arxiv.org/pdf/1905.11946.pdf\\ \ 48 | 3.https://ai.googleblog.com/2019/05/efficientnet-improving-accuracy-and.html \ 49 | 4.https://amaarora.github.io/2020/08/13/efficientnet.html \ 50 | 5.https://medium.com/@nainaakash012/efficientnet-rethinking-model-scaling-for-convolutional-neural-networks-92941c5bfb95 \ 51 | 6.https://arxiv.org/abs/1801.04381 52 | 53 | 54 | 55 | 56 | -------------------------------------------------------------------------------- /Transfer Learning/Inception/README.md: -------------------------------------------------------------------------------- 1 | ### Inception Network 2 | 3 | Inception Network was put forward by Google on the ImageNet Visual Recognition Challenge (in 2014). It was first introduced in the paper "Going deeper with convolutions. 4 | 5 | Inception network was seen to perform better than most other networks on the benchmark dataset. Apart from the network, Inception networks in its different versions, put forward several concepts like, the Inception module, the use of 1x1 convolutions for dimensionality reduction and the factorized convolutions. 6 | 7 | The basic driving idea behind the basic inception module is, different sized kernels provide different features from the same image data. For instance, a 3x3 convolution provides a feature map different from a 5x5 convolution. If we stack up these different kernels, it will create a deep network, but there is a chance of loss of information and a vanishing gradient problem. So, the authors decided to create a wider network instead of a deep network, so that the units of different convolutional layers will work parallely on the same data, and all the obtained results will be stacked up and concatenated. So, several feature maps highlighting different features are present in the final stack. 8 | 9 | ![Naive](https://miro.medium.com/max/770/1*DKjGRDd_lJeUfVlY50ojOA.png) 10 | 11 | The above image shows the naive inception model, but it causes a problem, a dimension explosion problem. As we can see a number of convolution layers a present here, with kernel dimension upto 5x5. This results in a huge number parameters and a huge time complexity. The problem was solved by using by adopting a dimensionality reduction strategy using 1x1 convlutions. 12 | 13 | ![Mod](https://miro.medium.com/max/770/1*U_McJnp7Fnif-lw9iIC5Bw.png) 14 | 15 | #### Inception V1 16 | 17 | ![v1](https://miro.medium.com/max/770/1*uW81y16b-ptBDV8SIT1beQ.png) 18 | 19 | The above image represents the version 1 of the inception networks. It used 9 inception modules, having 22 layers of convolutions. One of the most important challenge faced by inception v1 was the vanishing gradient problem because of the depth of the network. So, two **auxillary classifiers** were added to the network (shown in box), which injected new gradient and didn't let it vanish. 20 | 21 | #### Inception V2 22 | 23 | Inception V2 uncovered that a combination of two 3x3 covolutional layers instead of a single 5x5 convolutional improves the speed and performance of the network. Using 5x5 kernels was found to be 2.78 times expensive. It put forward the idea of **factorization** of the convolutional layers. So, it factorized 5x5 kernel into 2 3x3 network. It was also observed that if we factorize nxn kernel into assymetric kernel filters like 1xn and nx1, the arrangement provides an improvement in the performance as well as the operations become cheaper. In v2, the inception module is modified as: 24 | 25 | ![v2](https://miro.medium.com/max/633/1*DVXTxBwe_KUvpEs3ZXXFbg.png) 26 | 27 | #### Inception V3 28 | 29 | Inception V3 brought some changes as compared to inception v2 as it introducedm RMSProp Optimizer, a 7x7 convolution in a factorized manner and batch normalization. 30 | 31 | There has been other models like inception v4 and Inception-Resnet which are currently being used as State-of-Art architecture. 32 | 33 | References: 34 | 35 | 1.https://arxiv.org/pdf/1409.4842.pdf 36 | 2.https://arxiv.org/pdf/1512.00567v3.pdf 37 | 3.https://arxiv.org/pdf/1602.07261.pdf 38 | 4.https://www.geeksforgeeks.org/ml-inception-network-v1/ 39 | 5.https://www.geeksforgeeks.org/inception-v2-and-v3-inception-network-versions/ 40 | 6.https://medium.com/analytics-vidhya/inception-network-and-its-derivatives-e31b14388bf9 41 | 7.https://towardsdatascience.com/a-simple-guide-to-the-versions-of-the-inception-network-7fc52b863202 42 | 43 | 44 | 45 | -------------------------------------------------------------------------------- /Transfer Learning/Mobile_net/README.md: -------------------------------------------------------------------------------- 1 | ### Mobile Net: 2 | 3 | Mobile net has been proposed by Google in two versions, V1 and V2 correspondingly in the years 2017 and 2018. The MobileNet are designed to be light-weight, and very cost efficient to be used on-device use like mobile devices and cloud APIs. 4 | 5 | ### MobileNet-V1 6 | 7 | The MobileNet-V1 uses Depthwise Seperable Convolutions. It has much fewer parameters and reduced number of mathematical operaions. The MobileNet-V1 also introduces, a Width Multiplier α for Thinner Models and Resolution Multiplier ρ for Reduced Representation. These two parameters are provided by the user to control the weight or resource consumed by the model. 8 | 9 | The idea behind **depthwise seperable convolution** is that it seperates the convolutions in a spatial and a cross-channel or depthwise operation. In the original seperable convolution process, we have a depthwise convolution followed by a pointwise convolution. 10 | 11 | ![depsep](https://miro.medium.com/max/770/1*VvBTMkVRus6bWOqrK1SlLQ.png) 12 | 13 | As the above image shows, the depthwise convolution is the spatial convolution, so, 1 filter is convoluted with 1 channel at a time. So, for z channels in the image, we will have z nxn spatial convolutions. It is followed by pointwise convolutions, for changing the dimension, so, nxnxz shape is convoluted with 1x1xy filters to be converted to nxnxy. So, basically it deals with the cross-channel or depthwise convolutions. This seperation boosts performance and decreases the number of parameters and makes the model lighter. 14 | 15 | As we have already seen, if we use depthwise seperable convolutions, we perform depthwise convolutions, followed by pointwise convolutions. For depthwise convolutions, we require in total **n x n x z x k x k** mathematical operations where z: Number of input channels, n: Kernel size, and k: Feature map size. For the pointwise cnvolutions, we use **1 x 1 x y x k x k x z** where, y is the number of output channels. So, in total number of operations are: 16 | 17 | **total= n x n x z x k x k + 1 x 1 x y x k x k x z** 18 | 19 | ![Unit-1](https://miro.medium.com/max/408/1*XFvQWA4VESeQIcxzXMHCAg.png) 20 | 21 | The 1st unit shows the normal convolutional unit and the second unit shows the modified depthwise seperable convolutions. 22 | 23 | Now, the width control parameter α is introduced to control the number of channels or channel depth. So, the modified number of operations become: 24 | 25 | **total= n x n x αz x k x k + 1 x 1 x αy x k x k x αz** 26 | 27 | α is kept always in the interval (0,1], the typical values being 0.25, 1, 0.75 and 0.5. The baseline has α=1. 28 | 29 | Next, the paper introduces Resolution Multiplier ρ to control the input image resolution of the network. So, it gets multiplied with the feature maps. So, the modified number of operations become: 30 | 31 | **total= n x n x αz x ρk x ρk + 1 x 1 x αy x ρk x ρk x αz** 32 | 33 | ρ varies between 0 and 1, with 1 as the baseline. 34 | 35 | Thus the parameters decrease or increase with value of rho and alpha. 36 | 37 | ### MobileNet-V2: 38 | 39 | The MobileNet-V2 modified the MobileNet-V1's basic convolutional block module as shown: 40 | 41 | ![update](https://miro.medium.com/max/770/1*bqE59FvgpvoAQUMQ0WEoUA.png) 42 | 43 | The V2 has two types of modules, 1 with stride=1, that keep the dimension unaltered and with dimension stride=2, to reduce the size of the image. The module with stride 1 uses a Residual skip connection structure, while the 2nd block is relatively simple. 44 | 45 | The V2 module brought 1 major change, that is the last layer of the block is a 1x1 convolution layer with linear activation. The paper claimed that if ReLU is used again, the deep networks only have the power of a linear classifier on the non-zero volume part of the output domain. 46 | 47 | 48 | The V2 model gave a much higher performances compared to other models. 49 | 50 | References: 51 | 52 | 1. https://arxiv.org/pdf/1704.04861.pdf 53 | 2. https://arxiv.org/pdf/1801.04381.pdf 54 | 3. https://ai.googleblog.com/2018/04/mobilenetv2-next-generation-of-on.html 55 | 4. https://towardsdatascience.com/review-mobilenetv1-depthwise-separable-convolution-light-weight-model-a382df364b69 56 | 5. https://towardsdatascience.com/review-mobilenetv2-light-weight-model-image-classification-8febb490e61c 57 | 6. https://analyticsindiamag.com/why-googles-mobilenetv2-is-a-revolutionary-next-gen-on-device-computer-vision-network/ 58 | 59 | -------------------------------------------------------------------------------- /Transfer Learning/README.md: -------------------------------------------------------------------------------- 1 | ### Transfer Learning 2 | 3 | 1. In most cases, in image processing and detection algorithms need to be super well trained to perform on complex datasets inorder to accomplish certain tasks. 4 | 2. Sometimes datasets are very small and we do not have enough to train large and deep models 5 | 6 | For the above circumstances, we just do not randomly initialize a network and train it to perform, we use already pretrained networks for the tasks. 7 | 8 | There are several challenges, for example, the ImageNet and MS-COCO challenges, where deep network models are put forward and proposed, and their performances are compared on standard benchmark datasets. The best performing model is stored as a standard state of the art performing pretrained model for further use. 9 | 10 | The above concept is called Transfer Learning. Transfer learning allows us to deal with these scenarios by leveraging the already existing labeled data of some related task or domain. We can use the standard models in a number of ways. 11 | 12 | 1. We can use only the structure of the networks, and iniialize them with random weights, and train them. 13 | 2. We can modify the layer structure, add some of our own layers at the top. 14 | 3. We can use both the weights and structure of the networks. 15 | 4. We may or may not modify, the weights of the pretrained networks. We can only train the weights of our addad layers leaving the rest of the networks unchanged during training. 16 | 17 | The above summarizes transfer learning. Next we go to explanation. 18 | 19 | 20 | ### The terminologies and variants 21 | 22 | The formal defination states, in case of transfer learning, we have a *Xs*, a source feature set, *P(Xs)* probability distribution, *D(Xs, P(Xs))* a source domain and a *Ts*, which is a source task with *Ys* labels. Similarly, we have *Xt*,a target distribution, *D(Xt,P(Xt))* a target domain, and so on. 23 | 24 | So, our goal is to use the knowledge gained from the source domain to achieve a source task in order to accomplish a target task in the target domain. Depening on the above circumstances, several condition may arise: 25 | 26 | 1. Xt!=Xs: The feature spaces are different 27 | 2. P(Xt)!=P(Xs): The feature spaces are same, but distributions are different 28 | 3. Ys!=Yt: The labels of source and targe are different 29 | 4. P(Ys|Xs)!=P(Yt|Xt): The conditional probabilities are different. 30 | 31 | Based on the above, several methods of transfer learning strategies have been adopted to solve the problems in this field. 32 | 33 | ### Transfer learning in Deep learning: 34 | 35 | In deep learning, we use **inductive based transfer learning**. The deep transfer learning, we have two different strategies: 36 | 37 | 1. Off-the-shelf Pre-trained model as Feature extractors: Here the pre-trained models, are modified as feature extractors, just by removing the top classification layers, The layered architecture is leveraged into obtaining features from the images. We do not tune the filters in this case. 38 | 39 | 2. Fine Tuning Off-the-shelf Pre-trained Models: In this case, we replace the final layers with new layers based on the task at hand. We also train the last few layers of the network, in order to obtain finer work. It is due to the fact that the initial layers extract the more generic features, while the deeper layers are more task specific. 40 | 41 | The available models for transfer learning in computer vision. 42 | ![types](https://www.educative.io/api/edpresso/shot/4574405643468800/image/4521173214822400) 43 | 44 | References: 45 | 46 | 1. https://towardsdatascience.com/a-comprehensive-hands-on-guide-to-transfer-learning-with-real-world-applications-in-deep-learning-212bf3b2f27a 47 | 2. https://ruder.io/transfer-learning/ 48 | 3. https://www.cse.ust.hk/~qyang/Docs/2009/tkde_transfer_learning.pdf 49 | 4. https://www.analyticsvidhya.com/blog/2017/06/transfer-learning-the-art-of-fine-tuning-a-pre-trained-model/ 50 | 51 | Examples: solved on cats-vs-dogs dataset. 52 | 53 | **One important point to remember about training model using transfer learning is:** 54 | 55 | When the models were initially trained for the imagenet datasets, the authors made some preprocessing to the data, that rescaled the images and shifted the pixel values to a certain amount to obtain the best results. So, when we load the "imagenet" weights, the filter weights are tuned to work, after the image is passsed through the same preprocessing units. 56 | 57 | So, often passing the raw image data using custom preprocessing won't yield better results, and as the models are bulky, it takes huge time and data to train the model parameters with respect to the new custom preprocessing. Although, it is not always required, depending on the problem and the data. To deal with these problems, Tensorflow provides the preprocess_input functions, for every transfer learning model, as shown in v3 of the shown examples above. 58 | 59 | For example, we can find the preprocess unit of the resnet at https://www.tensorflow.org/api_docs/python/tf/keras/applications/resnet/preprocess_input. 60 | 61 | We can use the preprocessing function seperately, manually as 62 | 63 | ~~~from keras.applications.resnet50 import ResNet50 64 | from keras.preprocessing import image 65 | from keras.applications.resnet50 import preprocess_input 66 | import numpy as np 67 | 68 | model = ResNet50(weights='imagenet') 69 | 70 | img_path = 'elephant.jpg' 71 | img = image.load_img(img_path, target_size=(224, 224)) 72 | x = image.img_to_array(img) 73 | x = preprocess_input(x) 74 | ~~~ 75 | 76 | Or we can deploy it using the ImageDataGenerator: 77 | 78 | ~~~from tensorflow.keras.preprocessing.image import ImageDataGenerator 79 | from keras.applications.resnet_v2 import preprocess_input 80 | train_set=ImageDataGenerator( 81 | rotation_range=20, 82 | zoom_range=0.1, 83 | shear_range=0.2, 84 | preprocessing_function=preprocess_input 85 | ) 86 | val_set=ImageDataGenerator(preprocessing_function=preprocess_input) 87 | 88 | real_label=val_generator.classes ##This gets the real labels of the data passed. 89 | ~~~ 90 | 91 | 92 | *For GPU, its better to keep the batch_size bigger* 93 | 94 | ### Preprocess input vs Re-scaling 95 | 96 | Sometimes, there are often some confusions about whether to rescale 97 | 98 | ~~~from tensorflow.keras.preprocessing.image import ImageDataGenerator 99 | from keras.applications.resnet_v2 import preprocess_input 100 | train_set=ImageDataGenerator( 101 | rescale=1/255, 102 | rotation_range=20, 103 | zoom_range=0.1, 104 | shear_range=0.2 105 | ) 106 | val_set=ImageDataGenerator(rescale=1/255) 107 | 108 | real_label=val_generator.classes ##This gets the real labels of the data passed. 109 | ~~~ 110 | 111 | Or use preprocess_input as shown above in the image datagenerator as a preprocessing unit. 112 | 113 | **When we are training a data from the scratch or if we are going to retrain the transfer learning base model completely, we should use scaling (1/255), but if we are using the pretrained weights from the transfer learning models, it is firmly advised to use the preprocess_input function as preprocesing function.** 114 | 115 | The scaling (1/255) works like a Min-Max Normalization Transformation. It maps the pixel values that ranges from 0-255 to 0-1. This lets the model work on smaller values and increases stability of training. If we are using from scratch model or training the whole base model from the scratch, we can adapt the weights as we like, so we can use rescaling, e do not need any particular preprocessing 116 | 117 | If we are using pretrained models and just training the newly added fully connected layers, we will need the exact preprocessing for the images as were provided during the model training. So, Preprocess_input function for that particular base model is necessary. 118 | 119 | *As most of the transfer learning models are bulky, training them may need a lot of data and epochs to train* 120 | 121 | ### Training on Gray-scale image 122 | 123 | As "imagenet" dataset is a "RGB" or 3 channel dataset, each image has 3 channels. But the transfer learning are trained on 3-channel inputs. So, in order to train on Gray-scale or 1 channel images, we need to load and save the weights using a dummy model. Then we need to initialize another model without loading the default imagenet weights. Next, we need to load the weights, from the saved weights from the previous demo model, and skip any mismatch. 124 | 125 | ~~~ 126 | weight_model = ResNet50V2(weights='imagenet', include_top=False) ##Initializing Resnet50v2 model 127 | weight_model.save_weights(fldr+'weights.h5') ##Saving wieghts 128 | 129 | input_tensor=Input(shape=shape) ## Setting Input Size 130 | 131 | base_model = ResNet50V2(weights=None, include_top=False, input_tensor=input_tensor) ##Initializing new base model 132 | base_model.load_weights(fldr+'weights.h5',skip_mismatch=True, by_name=True) ##loading the weights (Transfer Learning) 133 | ~~~ 134 | 135 | 136 | -------------------------------------------------------------------------------- /Transfer Learning/Resnet/README.md: -------------------------------------------------------------------------------- 1 | ### Resnet 2 | 3 | Resnet or the Residual Neural Networks was first proposed by Microsoft Research in year 2015 Imagenet competition. The paper titled "Deep Residual Learning for Image Recognition" put the idea forward. The driving concept of the Resnet is the introduction of "skip connections". 4 | 5 | Imagenet competitions have shown the increasingly deep neural network architecture can be used to boost network performance. But it was also seen that often deep layered networks, perform worse than shallow networks, this is because deep networks face the problems of "vanishing" and "exploding" gradients. As the gradients are muliplied a number of times, and as the differentiation of error is a very small number, it almost vanishes as we go deeper in the network, and the layers deeper fail to learn, as there is no significant update of the weight. 6 | 7 | To solve the problem, the authors introduced skip connection modules called residual module. 8 | 9 | ![res_module](https://media.geeksforgeeks.org/wp-content/uploads/20200424011510/Residual-Block.PNG) 10 | 11 | The above image shows the perfect representation of the Resnet Module. The skip connections are designed to skip a few convolutional filters and directly add the gradient to the output. The idea as shown by the paper is instead of layers learn the underlying mapping, we allow network fit the residual mapping. So, instead of say H(x), initial mapping, let the network fit, F(x):= H(x)–x which gives H(x):= F(x) + x. 12 | 13 | Now, we just add the gradient of the previous layers to the output of the next layers. This prevents the gradient from fading. The skip connections provide a free way for the gradient to move forward. 14 | 15 | #### Architecture: Resnet V1 16 | 17 | ![Arch](https://miro.medium.com/max/3112/1*yy6Bbnp38MhcfDQbzOGf4A.png) 18 | 19 | The above image shows the architecture for a proposed Resnet-34 network, having 34 convolutional layers. We also found that the similar one with 50 layers Resnet-50 V1 provides better performance. 20 | 21 | #### Resnet-V2 22 | 23 | Resnet-V2 brought a minor modification to the architecture of the Resnet-v1 module, which brought a mejor increase in the performance of the model. initially Resnet-v1 was a post-activation module, that is the ReLU and batch normlization was done in the main stem so, it effected the skip connections. Contemprary reseaech showed, that more clear the skip connections, better is the performance. So, the Resnet-V2 used a Pre-activation approach instead of the post-activation, using the Relu and the Batch normaliation as a part of the skip connection. 24 | 25 | ![V2](https://miro.medium.com/max/770/1*M5NIelQC33eN6KjwZRccoQ.png) 26 | 27 | References: 28 | 29 | 1. https://arxiv.org/pdf/1512.03385.pdf 30 | 2. https://arxiv.org/pdf/1603.05027.pdf 31 | 3. https://towardsdatascience.com/an-overview-of-resnet-and-its-variants-5281e2f56035 32 | 4. https://towardsdatascience.com/resnet-with-identity-mapping-over-1000-layers-reached-image-classification-bb50a42af03e#:~:text=Using%20pre%2Dactivation%20unit%20can,deeper%20from%20110%20to%201001.&text=Pre%2Dactivation%20unit%20is%20on,but%20with%20lower%20test%20error. 33 | 34 | 35 | 36 | 37 | -------------------------------------------------------------------------------- /Transfer Learning/Transfer_Learning_v2.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "name": "Transfer_Learning_v2.ipynb", 7 | "provenance": [], 8 | "machine_shape": "hm" 9 | }, 10 | "kernelspec": { 11 | "name": "python3", 12 | "display_name": "Python 3" 13 | }, 14 | "language_info": { 15 | "name": "python" 16 | }, 17 | "accelerator": "GPU" 18 | }, 19 | "cells": [ 20 | { 21 | "cell_type": "code", 22 | "metadata": { 23 | "colab": { 24 | "base_uri": "https://localhost:8080/" 25 | }, 26 | "id": "BmxVpRQ867aV", 27 | "outputId": "bd792410-c5df-41e4-8af1-c5c8f945eeaa" 28 | }, 29 | "source": [ 30 | "from google.colab import drive\n", 31 | "drive.mount('/content/drive',force_remount=True)" 32 | ], 33 | "execution_count": 1, 34 | "outputs": [ 35 | { 36 | "output_type": "stream", 37 | "text": [ 38 | "Mounted at /content/drive\n" 39 | ], 40 | "name": "stdout" 41 | } 42 | ] 43 | }, 44 | { 45 | "cell_type": "code", 46 | "metadata": { 47 | "id": "-w427Sve74RW" 48 | }, 49 | "source": [ 50 | "FLDR='drive/MyDrive/dogs-vs-cats/'\n" 51 | ], 52 | "execution_count": 2, 53 | "outputs": [] 54 | }, 55 | { 56 | "cell_type": "code", 57 | "metadata": { 58 | "id": "f0DhMUEV8OH2" 59 | }, 60 | "source": [ 61 | "import os" 62 | ], 63 | "execution_count": 3, 64 | "outputs": [] 65 | }, 66 | { 67 | "cell_type": "code", 68 | "metadata": { 69 | "id": "w8boRJ5M8P8Z" 70 | }, 71 | "source": [ 72 | "files_zip_full=os.listdir(FLDR)" 73 | ], 74 | "execution_count": 4, 75 | "outputs": [] 76 | }, 77 | { 78 | "cell_type": "code", 79 | "metadata": { 80 | "colab": { 81 | "base_uri": "https://localhost:8080/" 82 | }, 83 | "id": "8jazKmVj8RcI", 84 | "outputId": "8d703746-fe07-4f7b-bb89-ad98e6431257" 85 | }, 86 | "source": [ 87 | "files_zip_full" 88 | ], 89 | "execution_count": 5, 90 | "outputs": [ 91 | { 92 | "output_type": "execute_result", 93 | "data": { 94 | "text/plain": [ 95 | "['sampleSubmission.csv', 'test1.zip', 'train.zip', 'weights.h5']" 96 | ] 97 | }, 98 | "metadata": { 99 | "tags": [] 100 | }, 101 | "execution_count": 5 102 | } 103 | ] 104 | }, 105 | { 106 | "cell_type": "code", 107 | "metadata": { 108 | "id": "wIeZvHgq9ZLF" 109 | }, 110 | "source": [ 111 | "train_data=files_zip_full[2]" 112 | ], 113 | "execution_count": 6, 114 | "outputs": [] 115 | }, 116 | { 117 | "cell_type": "code", 118 | "metadata": { 119 | "colab": { 120 | "base_uri": "https://localhost:8080/", 121 | "height": 35 122 | }, 123 | "id": "7ijFXJ9w9cmL", 124 | "outputId": "3b3931ca-17aa-46c0-9c39-82d9f54fce52" 125 | }, 126 | "source": [ 127 | "train_data" 128 | ], 129 | "execution_count": 7, 130 | "outputs": [ 131 | { 132 | "output_type": "execute_result", 133 | "data": { 134 | "application/vnd.google.colaboratory.intrinsic+json": { 135 | "type": "string" 136 | }, 137 | "text/plain": [ 138 | "'train.zip'" 139 | ] 140 | }, 141 | "metadata": { 142 | "tags": [] 143 | }, 144 | "execution_count": 7 145 | } 146 | ] 147 | }, 148 | { 149 | "cell_type": "code", 150 | "metadata": { 151 | "id": "84zjs4w-9eFP" 152 | }, 153 | "source": [ 154 | "import zipfile\n", 155 | "import cv2" 156 | ], 157 | "execution_count": 8, 158 | "outputs": [] 159 | }, 160 | { 161 | "cell_type": "code", 162 | "metadata": { 163 | "id": "jfmlX44b9gKs" 164 | }, 165 | "source": [ 166 | "archive=zipfile.ZipFile(FLDR+train_data)\n", 167 | "archive.extractall(FLDR+'extracted')\n", 168 | " " 169 | ], 170 | "execution_count": 10, 171 | "outputs": [] 172 | }, 173 | { 174 | "cell_type": "code", 175 | "metadata": { 176 | "id": "RAyHVYFr-l-s" 177 | }, 178 | "source": [ 179 | "files_zip_full=os.listdir(FLDR)" 180 | ], 181 | "execution_count": 11, 182 | "outputs": [] 183 | }, 184 | { 185 | "cell_type": "code", 186 | "metadata": { 187 | "colab": { 188 | "base_uri": "https://localhost:8080/" 189 | }, 190 | "id": "SA2ZxrXpCYD0", 191 | "outputId": "6d734ce0-6df2-484a-9855-87c34d04f54d" 192 | }, 193 | "source": [ 194 | "files_zip_full" 195 | ], 196 | "execution_count": 12, 197 | "outputs": [ 198 | { 199 | "output_type": "execute_result", 200 | "data": { 201 | "text/plain": [ 202 | "['sampleSubmission.csv', 'test1.zip', 'train.zip', 'weights.h5', 'extracted']" 203 | ] 204 | }, 205 | "metadata": { 206 | "tags": [] 207 | }, 208 | "execution_count": 12 209 | } 210 | ] 211 | }, 212 | { 213 | "cell_type": "code", 214 | "metadata": { 215 | "id": "LlZZC9fZCZOA" 216 | }, 217 | "source": [ 218 | "test_data=files_zip_full[1]" 219 | ], 220 | "execution_count": 13, 221 | "outputs": [] 222 | }, 223 | { 224 | "cell_type": "code", 225 | "metadata": { 226 | "id": "EBCbbLJLCeZe" 227 | }, 228 | "source": [ 229 | "archive=zipfile.ZipFile(FLDR+test_data)\n", 230 | "archive.extractall(FLDR+'extracted_2')" 231 | ], 232 | "execution_count": 14, 233 | "outputs": [] 234 | }, 235 | { 236 | "cell_type": "code", 237 | "metadata": { 238 | "id": "zxDaGApQCoHh" 239 | }, 240 | "source": [ 241 | "files_zip_full=os.listdir(FLDR)" 242 | ], 243 | "execution_count": 15, 244 | "outputs": [] 245 | }, 246 | { 247 | "cell_type": "code", 248 | "metadata": { 249 | "colab": { 250 | "base_uri": "https://localhost:8080/" 251 | }, 252 | "id": "wYHKOQV9DD7j", 253 | "outputId": "9ea2c5f5-17c1-47f0-ea8b-8789e23c7f7f" 254 | }, 255 | "source": [ 256 | "files_zip_full" 257 | ], 258 | "execution_count": 16, 259 | "outputs": [ 260 | { 261 | "output_type": "execute_result", 262 | "data": { 263 | "text/plain": [ 264 | "['sampleSubmission.csv',\n", 265 | " 'test1.zip',\n", 266 | " 'train.zip',\n", 267 | " 'weights.h5',\n", 268 | " 'extracted',\n", 269 | " 'extracted_2']" 270 | ] 271 | }, 272 | "metadata": { 273 | "tags": [] 274 | }, 275 | "execution_count": 16 276 | } 277 | ] 278 | }, 279 | { 280 | "cell_type": "code", 281 | "metadata": { 282 | "id": "ddhqVEEpDE0g" 283 | }, 284 | "source": [ 285 | "training_data=files_zip_full[4]+'/train/'" 286 | ], 287 | "execution_count": 19, 288 | "outputs": [] 289 | }, 290 | { 291 | "cell_type": "code", 292 | "metadata": { 293 | "colab": { 294 | "base_uri": "https://localhost:8080/", 295 | "height": 35 296 | }, 297 | "id": "kJNcDR3oDa3E", 298 | "outputId": "87e3df25-3300-4638-f3ee-6dad74b3accd" 299 | }, 300 | "source": [ 301 | "training_data" 302 | ], 303 | "execution_count": 20, 304 | "outputs": [ 305 | { 306 | "output_type": "execute_result", 307 | "data": { 308 | "application/vnd.google.colaboratory.intrinsic+json": { 309 | "type": "string" 310 | }, 311 | "text/plain": [ 312 | "'extracted/train/'" 313 | ] 314 | }, 315 | "metadata": { 316 | "tags": [] 317 | }, 318 | "execution_count": 20 319 | } 320 | ] 321 | }, 322 | { 323 | "cell_type": "code", 324 | "metadata": { 325 | "id": "YNbvnq27DeBo" 326 | }, 327 | "source": [ 328 | "Labels=[]\n", 329 | "Images=[]\n", 330 | "\n", 331 | "for f_n in os.listdir(FLDR+training_data):\n", 332 | " print(f_n)\n", 333 | " if 'cat' in f_n or 'Cat' in f_n:\n", 334 | " img=cv2.imread(FLDR+training_data+f_n,cv2.IMREAD_UNCHANGED)\n", 335 | " img=cv2.resize(img,(224,224))\n", 336 | " Images.append(img)\n", 337 | " Labels.append(0)\n", 338 | " if 'dog' in f_n or 'Dog' in f_n:\n", 339 | " img=cv2.imread(FLDR+training_data+f_n,cv2.IMREAD_UNCHANGED)\n", 340 | " img=cv2.resize(img,(224,224))\n", 341 | " Images.append(img)\n", 342 | " Labels.append(1)" 343 | ], 344 | "execution_count": null, 345 | "outputs": [] 346 | }, 347 | { 348 | "cell_type": "code", 349 | "metadata": { 350 | "id": "MEIRVBygH6N-" 351 | }, 352 | "source": [ 353 | "import numpy as np" 354 | ], 355 | "execution_count": 35, 356 | "outputs": [] 357 | }, 358 | { 359 | "cell_type": "code", 360 | "metadata": { 361 | "id": "jbqkghjdE11_" 362 | }, 363 | "source": [ 364 | "from sklearn.model_selection import train_test_split\n", 365 | "X_train,X_val,Y_train,Y_val= train_test_split(np.array(Images),np.array(Labels),test_size=0.15)" 366 | ], 367 | "execution_count": 44, 368 | "outputs": [] 369 | }, 370 | { 371 | "cell_type": "code", 372 | "metadata": { 373 | "id": "lfOl_TUdG0Sn" 374 | }, 375 | "source": [ 376 | "from tensorflow.keras.models import Model,Sequential\n", 377 | "from tensorflow.keras.layers import Dropout, Flatten, Dense, Input\n", 378 | "from tensorflow.keras.applications.resnet_v2 import ResNet50V2\n", 379 | "from keras import optimizers\n", 380 | "from keras.callbacks import ModelCheckpoint" 381 | ], 382 | "execution_count": 55, 383 | "outputs": [] 384 | }, 385 | { 386 | "cell_type": "code", 387 | "metadata": { 388 | "id": "NWd5gBorLyrs" 389 | }, 390 | "source": [ 391 | "def model(shape=(224,224,3)):\n", 392 | " input_tensor=Input(shape=shape) \n", 393 | "\n", 394 | " base_model = ResNet50V2(weights=\"imagenet\", include_top=False, input_tensor=input_tensor) \n", 395 | "\n", 396 | " model = Sequential([\n", 397 | " base_model,\n", 398 | " Flatten(), \n", 399 | " Dense(512, activation='relu'),\n", 400 | " Dropout(0.3), \n", 401 | " Dense(1, activation='sigmoid') \n", 402 | " ])\n", 403 | "\n", 404 | " for layer in model.layers:\n", 405 | " layer.trainable = True\n", 406 | " model.compile(optimizer=optimizers.Adam(lr=0.01), loss='binary_crossentropy',metrics=['accuracy'])\n", 407 | " return model\n" 408 | ], 409 | "execution_count": 58, 410 | "outputs": [] 411 | }, 412 | { 413 | "cell_type": "code", 414 | "metadata": { 415 | "id": "qnnxG0FjHKwO" 416 | }, 417 | "source": [ 418 | "Model=model()" 419 | ], 420 | "execution_count": 59, 421 | "outputs": [] 422 | }, 423 | { 424 | "cell_type": "code", 425 | "metadata": { 426 | "colab": { 427 | "base_uri": "https://localhost:8080/" 428 | }, 429 | "id": "mNJaRrncHXPF", 430 | "outputId": "2e4be0ab-9539-4dcf-e54f-b7835c027db8" 431 | }, 432 | "source": [ 433 | "Model.summary()" 434 | ], 435 | "execution_count": 60, 436 | "outputs": [ 437 | { 438 | "output_type": "stream", 439 | "text": [ 440 | "Model: \"sequential_1\"\n", 441 | "_________________________________________________________________\n", 442 | "Layer (type) Output Shape Param # \n", 443 | "=================================================================\n", 444 | "resnet50v2 (Functional) (None, 7, 7, 2048) 23564800 \n", 445 | "_________________________________________________________________\n", 446 | "flatten_4 (Flatten) (None, 100352) 0 \n", 447 | "_________________________________________________________________\n", 448 | "dense_8 (Dense) (None, 512) 51380736 \n", 449 | "_________________________________________________________________\n", 450 | "dropout_4 (Dropout) (None, 512) 0 \n", 451 | "_________________________________________________________________\n", 452 | "dense_9 (Dense) (None, 1) 513 \n", 453 | "=================================================================\n", 454 | "Total params: 74,946,049\n", 455 | "Trainable params: 74,900,609\n", 456 | "Non-trainable params: 45,440\n", 457 | "_________________________________________________________________\n" 458 | ], 459 | "name": "stdout" 460 | } 461 | ] 462 | }, 463 | { 464 | "cell_type": "code", 465 | "metadata": { 466 | "id": "TQ-yjFaWHZZE" 467 | }, 468 | "source": [ 469 | "checkpoint = ModelCheckpoint(FLDR+\"Model.h5\", monitor='val_accuracy', save_best_only=True, mode='max') #creating checkpoint to save the best validation accuracy\n", 470 | "callbacks_list = [checkpoint]\n", 471 | "history = Model.fit(X_train,Y_train,validation_data =(X_val,Y_val),callbacks=callbacks_list,verbose=1,epochs=30,batch_size=32)" 472 | ], 473 | "execution_count": null, 474 | "outputs": [] 475 | }, 476 | { 477 | "cell_type": "code", 478 | "metadata": { 479 | "id": "zrlI1DtGHqGB" 480 | }, 481 | "source": [ 482 | "" 483 | ], 484 | "execution_count": null, 485 | "outputs": [] 486 | } 487 | ] 488 | } -------------------------------------------------------------------------------- /Transfer Learning/VGG/README.md: -------------------------------------------------------------------------------- 1 | ### VGG Network 2 | 3 | VGG16 is a convolutional neural network model proposed by K. Simonyan and A. Zisserman from the University of Oxford. It was one of the famous model submitted to ILSVRC-2014. The architecture was put forward in a paper titled "Very Deep Convolutional Networks for Large-Scale Image Recognition”. The model is trained on the Imagenet dataset. 4 | 5 | The paper proposes, several variants of the main stem architecture, namely, VGG-11, VGG-16, VGG-19, and compares thier relative performances. The VGG architecture modified the architecture by replacing large-sized kernel in AlexNet, with multiple 3x3 sized kernels in a repetative manner creating blocks, everytime increasing the number of filters. 6 | 7 | #### Architecture 8 | 9 | ![VGG16](https://cdn.analyticsvidhya.com/wp-content/uploads/2020/04/VGG-2-850x208.png) 10 | 11 | The above image describes the architecture of the VGG16 network. It consista of total 13 convolutional layers, divided into 5 blocks, first having 2 convolution layers and the rest 3 each. Maxpooling layer has been attached to the last convolution layer of each block. There are 64 filters for each convolutional layer in the first block, 128 filters for the 2nd block, 256 filters for each layer in the 3rd block, and 512 filters for each layer for both the 4th and 5th block of the architecture. Maxpooling is used to reduce the dimension, of the image. We start with a image of dimension (224,224,3), which is reduced to (7x7) after the 5the convolutional block. 12 | 13 | The 7x7 feature map is then flattened to obtain a vector which is passed through 3 fully connected layers. The last output layer has 1000 nodes, for the 1000 class classification. 14 | 15 | The other variants proposed are obtained by altering the network slightly. 16 | 17 | ![Variants](https://neurohive.io/wp-content/uploads/2018/11/Capture-564x570.jpg) 18 | 19 | In the above comparisons, A is VGG 11, B is VGG 13, D is VGG-16 and E is VGG-19. VGG-19 is seen to have a slightly better performance than the state of the art VGG-16 model. 20 | 21 | References: 22 | 1. https://arxiv.org/pdf/1409.1556.pdf 23 | 2. https://neurohive.io/en/popular-networks/vgg16/ 24 | 3. geeksforgeeks.org/vgg-16-cnn-model/ 25 | 26 | -------------------------------------------------------------------------------- /Transfer Learning/Xception/README.md: -------------------------------------------------------------------------------- 1 | ### Exception 2 | 3 | Exception stands for "extreme inception". It was first proposed by Google Researchers in the year 2017, in the imagenet challenge. 4 | 5 | The Exception network takes on the Inception network one step further. As we have already seen, the inception module uses a 1x1 convolutional layer for dimensionality reduction, the exception network, implements a depthwise seperable convolutional layers in place of the inception modules. 6 | 7 | The idea behind **depthwise seperable convolution** is that it seperates the convolutions in a spatial and a cross-channel or depthwise operation. In the original seperable convolution process, we have a depthwise convolution followed by a pointwise convolution. 8 | 9 | ![depsep](https://miro.medium.com/max/770/1*VvBTMkVRus6bWOqrK1SlLQ.png) 10 | 11 | As the above image shows, the depthwise convolution is the spatial convolution, so, 1 filter is convoluted with 1 channel at a time. So, for z channels in the image, we will have z nxn spatial convolutions. It is followed by pointwise convolutions, for changing the dimension, so, nxnxz shape is convoluted with 1x1xy filters to be converted to nxnxy. So, basically it deals with the cross-channel or depthwise convolutions. This seperation boosts performance and decreases the number of parameters and makes the model lighter. 12 | 13 | Now the exception model slightly modified the architecture of the original depth seperable convolution. The original has depthwise convolution has depthwise followed by pointwise convolutions, which was reversed in exception. It has been proved that, it doesnot bring much change in the result. Another point of modification is, the exception module has no intermediate non-linearity unit or RelU activation function unlike Inception. 14 | 15 | #### Architecture 16 | 17 | ![Arch](https://miro.medium.com/max/770/1*hOcAEj9QzqgBXcwUzmEvSg.png) 18 | 19 | The overall architectiure of Exception network has 3 different stage, Entry flow, Middle Flow and Exit flow. It applies seperable convolutions in addition to Residual skip connections. The architecture is similar to inception networks, with inception modules being replaced. 20 | 21 | References: / 22 | 23 | 1. https://arxiv.org/pdf/1610.02357.pdf / 24 | 2. https://towardsdatascience.com/an-intuitive-guide-to-deep-network-architectures-65fdc477db41 25 | 3. https://towardsdatascience.com/review-xception-with-depthwise-separable-convolution-better-than-inception-v3-image-dc967dd42568 26 | 4. https://maelfabien.github.io/deeplearning/xception/#the-limits-of-convolutions 27 | 28 | 29 | 30 | -------------------------------------------------------------------------------- /Unsupervised Learning/README.md: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /Vision.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/abr-98/Guide-to-Computer-Vision/be08627be27db5f2f98e600bc043b2c83de3750f/Vision.png --------------------------------------------------------------------------------