├── HOW TO RUN.txt ├── README.md ├── image-caption-generator.ipynb ├── output_1.png ├── output_2.png └── output_3.png /HOW TO RUN.txt: -------------------------------------------------------------------------------- 1 | IMPORTANT LINKS: 2 | 3 | Dataset link: https://www.kaggle.com/shadabhussain/flickr8k 4 | 5 | VGG-16 predefined weights: https://github.com/fchollet/deep-learning-models/releases/download/v0.1/vgg16_weights_tf_dim_ordering_tf_kernels.h5 6 | 7 | 8 | HOW TO EXECUTE THE CODE: 9 | 10 | 1. Make a new directory and download the image-caption-generator.ipynb file in the directory. 11 | 12 | 2. From the Dataset Link (https://www.kaggle.com/shadabhussain/flickr8k) download the 13 | complete dataset and save it in your directory. 14 | 15 | 3. Open the image-caption-generator.ipynb file in Jupyter Notebook. 16 | 17 | 4. In the 2nd code cell give the path to your "Flicker8k_Dataset" folder for the "dir_Flickr_jpg" variable. 18 | And the path for the "Flickr8k.token.txt" file for the "dir_Flickr_text" variable. 19 | 20 | 5. From the link (https://github.com/fchollet/deep-learning-models/releases/download/v0.1/vgg16_weights_tf_dim_ordering_tf_kernels.h5) 21 | Download the VGG-16 Model pre-defined weights for image classification and save it in your directory. 22 | 23 | 6. In the 12th code cell give the path to your downloaded predefined weights (.h5 file) in the previous step. 24 | 25 | 7. Run all the cells. 26 | 27 | 8. After execution of all the cells is complete (it will take 2-3 hours), the last cell will open a GUI interface. 28 | 29 | 9. Click on the "Choose Image" button in GUI and upload an image from dataset or any image from google 30 | and then click on the "Classify Image" button to get the predicted caption from the model. 31 | 32 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Image-Caption-Generator-with-GUI 2 | Caption generation is a challenging artificial intelligence problem where a textual description must be generated for a given photograph. My project is inspired from Andrej Karpathy famous blogpost The Unreasonable effectiveness of Recurrent Neural Networks. 3 | 4 | Dataset 5 | 6 | I have used Flickr8k dataset for this project. 7 | It contains around 8091 images and 5 captions for each image (5 * 8091 = 40455 captions) 8 | 9 | **Dataset Source: KAGGLE LINK https://www.kaggle.com/shadabhussain/flickr8k** 10 | 11 | Model Evaluation Matrix 12 | 13 | BLEU metric is an evaluation matrix that compares a generater sentence with a reference sentence. It compares n-grams (1-gram means one word) and a perfect match and a perfect mismatch results in a score of 1.0 and 0.0 respectively. 14 | 15 | ### Data Preparation 16 | We create a new dataframe called as dfword to visualize distribution of the words. It contains each word and its frequency in the entire tokens in decreasing order. 17 | 18 | ![image](https://user-images.githubusercontent.com/41522782/125451863-97951044-f7f3-432f-9592-c2bed0650bf9.png) 19 | 20 | Then we find the top 50 most and least frequently appearing words. Here, I found that the stopwords (like a, the) and punctuations are most occuring, we have to remove these from our dataset in order to clean it. I have implemented 3 functions to clean the captions: 21 | 1. remove_punctuation(text) 22 | 2. remove_single_character(text) 23 | 3. remove_numeric(text) 24 | 25 | Now we'll be adding start and end tokens in every caption ('startseq ' & ' endseq') 26 | 27 | ### Image Preparation for VGG16 Model 28 | We will be using pre-trained network VGG16. 29 | 30 | **Model Source: GITHUB LINK https://github.com/fchollet/deep-learning-models/releases/download/v0.1/vgg16_weights_tf_dim_ordering_tf_kernels.h5** 31 | 32 | VGG16 model takes input image of size (224, 224, 3) process it with DEEP CNN layers and output a tensor of shape (?, 1000) i.e. 1000 classes 33 | We have removed the last layer from the network as we only want the features (?, 4096). 34 | 35 | After this we'll reshape our images to (224, 224, 3) and feed it to the model to get features respective to each image. 36 | 37 | **Tokenizer** 38 | We'll now convert our captions into tokens eg (startseq = 1, endseq = 2 etc.) and store these into an array. 39 | 40 | ### Model Training 41 | 42 | **Splitting the dataset** 43 | We'll split the dataset (all 3 datasets i.e. captions_data, images_data and filenames_data) in ratio of 0.6 : 0.2 : 0.2 (train:valid:test) 44 | For the captions_data we have to do the padding since not all token_arrays are of same length, so we take the maximum length and do padding for other token_arrays. 45 | 46 | **Model** 47 | - The input to the model will be image-features of shape 4096. 48 | - First we got 256 unit outputs from images using Dense layer and we used the Embedding and LSTM layer to get 256 unit output from captions. 49 | - It uses Categorical cross entropy loss function & adam optimizer. 50 | - Then we train the model with our dataset. 51 | 52 | ### Graphical User Interface 53 | We have used tkinter library and PIL (Python Image Library) to make a GUI with Upload button to upload the images and Classify Image button to get generated caption. 54 | -------------------------------------------------------------------------------- /output_1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/akshatchaturvedi28/Image-Caption-Generator-with-GUI/e2923e0d4abce31ece5a5075b5e5556a59140d58/output_1.png -------------------------------------------------------------------------------- /output_2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/akshatchaturvedi28/Image-Caption-Generator-with-GUI/e2923e0d4abce31ece5a5075b5e5556a59140d58/output_2.png -------------------------------------------------------------------------------- /output_3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/akshatchaturvedi28/Image-Caption-Generator-with-GUI/e2923e0d4abce31ece5a5075b5e5556a59140d58/output_3.png --------------------------------------------------------------------------------