└── README.md /README.md: -------------------------------------------------------------------------------- 1 | # hand-gesture-recognition-deep-learning 2 | Project to recognize hand gesture using state of the art neural networks. 3 | 4 | ## Team 5 | 6 | Aman Srivastava [https://www.linkedin.com/in/amansrivastava1/] 7 | 8 | Nurul Q Khan [https://www.linkedin.com/in/nurulquamar/] 9 | 10 | Prakash Srinivasan [https://www.linkedin.com/in/prakash-srinivasan-6641812/] 11 | 12 | Tim Kumar [https://www.linkedin.com/in/tim-kumar-b1519252/] 13 | 14 | ## Problem Statement 15 | Imagine you are working as a data scientist at a home electronics company which manufactures state of the art smart televisions. You want to develop a cool feature in the smart-TV that can recognise five different gestures performed by the user which will help users control the TV without using a remote 16 | 17 | The gestures are continuously monitored by the webcam mounted on the TV. Each gesture corresponds to a specific command: 18 | 19 | - Thumbs up: Increase the volume 20 | - Thumbs down: Decrease the volume 21 | - Left swipe: 'Jump' backwards 10 seconds 22 | - Right swipe: 'Jump' forward 10 seconds 23 | - Stop: Pause the movie 24 | 25 | Each video is a sequence of 30 frames (or images) 26 | 27 | ## Understanding the Dataset 28 | The training data consists of a few hundred videos categorised into one of the five classes. Each video (typically 2-3 seconds long) is divided into a sequence of 30 frames(images). These videos have been recorded by various people performing one of the five gestures in front of a webcam - similar to what the smart TV will use. 29 | 30 | The data is in a [zip](https://drive.google.com/uc?id=1ehyrYBQ5rbQQe6yL4XbLWe3FMvuVUGiL) file. The zip file contains a 'train' and a 'val' folder with two CSV files for the two folders. 31 | 32 | ## Model Overview 33 | 34 | | Model Name | Model Type | Number of parameters | Augment Data | Model Size(in MB) | Highest Validation accuracy | Corres-ponding Training accuracy | Observations | 35 | |----------------|------------|----------------------|--------------|-------------------|-----------------------------|----------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------| 36 | | conv_3d1_model | Conv3D | 1,117,061 | No | NA | 78% | 99% | Model is over-fitting. Augment data using cropping | 37 | | conv_3d2_model | Conv3D | 3,638,981 | Yes | 43.8 | 85% | 91% | Model is not over-fitting. Next we will try to reduce the parameter size. Moreover since we see minor oscillations in loss, let's try lowering the learning rate to 0.0002 | 38 | | conv_3d3_model | Conv3D | 1,762,613 | Yes | 21.2 | 85% | 83% | Model has stable results .Also we were able to reduce the parameter size by half. Let's trying adding more layers at the same level of abstractions | 39 | | conv_3d4_model | Conv3D | 2,556,533 | Yes | 30.8 | 76% | 89% | With more layers added model is over-fitting. Let's try adding dropouts at the convolution layers | 40 | | conv_3d5_model | Conv3D | 2,556,533 | Yes | 30.8 | 70% | 89% | Adding dropouts has further reduced validation accuracy as its not to learn generalizable features and its further over-fitting | 41 | | conv_3d6_model | Conv3D | 696,645 | Yes | 8.46 | 77% | 92% | Reducing the number of network parameters by reducing image resolution/ filter size and dense layer neurons. Comparably good validation accuracy | 42 | | conv_3d7_model | Conv3D | 504,709 | Yes | 6.15 | 77% | 85% | | 43 | | conv_3d8_model | Conv3D | 230,949 | Yes | 2.87 | 78% | 86% | | 44 | | rnn_cnn1_model | CNN-LSTM | 1,657,445 | Yes | 20 | 75% | 92% | Model is over-fitting. Let’s try reducing the number of layers in next iteration | 45 | 46 | ## Models with More Data Augmentation 47 | 48 | | Model Name | Model Type | Number of parameters | Augment Data | Model Size(in MB) | Highest validation accuracy | Corresponding Training accuracy | 49 | |-----------------|------------|----------------------|--------------|-------------------|-----------------------------|---------------------------------| 50 | | conv_3d10_model | Conv3D | 3,638,981 | Yes | 43.8 | 86% | 86% | 51 | | conv_3d11_model | Conv3D | 1,762,613 | Yes | 21.2 | 78 % | 79 % | 52 | | conv_3d12_model | Conv3D | 2,556,533 | Yes | 30.8 | 81% | 84% | 53 | | conv_3d13_model | Conv3D | 2,556,533 | Yes | 30.8 | 31% | 78% | 54 | | conv_3d14_model | Conv3D | 696,645 | Yes | 8.46 | 77% | 87% | 55 | | conv_3d15_model | Conv3D | 504,709 | Yes | 6.15 | 75% | 82% | 56 | | conv_3d16_model | Conv3D | 230,949 | Yes | 2.87 | 76% | 77% | 57 | | rnn_cnn2_model | CNN-LSTM | 1,346,021 | Yes | 31 | 78% | 96% | 58 | 59 | ## Transfer Learning Models (CNN + RNN) 60 | ### Mobilenet model is considered as its parameter size is less compared to Inception and Resnet models 61 | 62 | | Model Name | Number of parameters | Augment Data | Model Size(in MB) | Highest validation accuracy | Corres-ponding Training accuracy | Observations | 63 | |-------------------|----------------------|--------------|-------------------|-----------------------------|----------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------| 64 | | rnn_cnn_tl_model | 3,840,453 | Yes | 20.4 | 56% | 85% | For this experiment, Mobilenet layer weights are not trained. Validation accuracy is very poor. So let’s train mobilenet layer’s weights as well | 65 | | rnn_cnn_tl2_model | 3,692,869 | Yes | 42.3 | 97% | 99% | We get a better accuracy on training mobilenet layer’s weights as well. | 66 | 67 | ## Note: If notebook doesnt load then view it here: https://nbviewer.jupyter.org/github/amanrocks11/hand-gesture-recognition-deep-learning/blob/master/Gesture_Recognition_Final.ipynb 68 | --------------------------------------------------------------------------------