├── .gitignore ├── Neural Image Caption Generator.ipynb ├── README.md └── requirements.txt /.gitignore: -------------------------------------------------------------------------------- 1 | data/ 2 | checkpoint/ 3 | .ipynb_checkpoints/ 4 | Image-Captioning.html 5 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Neural Image Caption Generator 2 | 3 |

4 | 5 |

6 | 7 | The purpose of image captioning is to automatically generate text describing a picture. In the past few years, it has become a topic of increasing interest in machine learning, and advances in this field have resulted in models (depending on which assessment) can score even higher than humans. 8 | 9 | 10 | ## 1. Requirements 11 | 12 | ### 1.1. Recommended system 13 | 14 | I used free gradient notebook of [paperspace](https://www.paperspace.com/) 15 | 16 | - 8 CPUs 17 | - 16 GBs of GPU 18 | - 32 GBs of RAM 19 | 20 | ### 1.2. Required librairies 21 | 22 | ```shell 23 | $ pip install -r requirements.txt 24 | $ python -m spacy download en 25 | $ python -m spacy download en_core_web_lg 26 | ``` 27 | 28 | ## 2. Data 29 | 30 | **Flickr8k dataset** It consists of pairs of images and their corresponding captions (there's five captions fo each image). 31 | 32 | Can be downloaded directy using links below: 33 | - [images](https://github.com/jbrownlee/Datasets/releases/download/Flickr8k/Flickr8k_Dataset.zip) 34 | - [captions](https://github.com/jbrownlee/Datasets/releases/download/Flickr8k/Flickr8k_text.zip) 35 | 36 | ## 3. Model structure 37 | 38 | We used the **encoder-decoder architecture** combined with **attention mechanism**. The encoder consists in extracting image representations and the decoder consists in generating image captions. 39 | 40 |

41 | 42 |

43 | 44 | ### 3.1. Encoder - Convolutional Neural Network 45 | 46 | CNN is a class of deep, feed-forward artificial neural networks that has successfully been applied to analyzing visual imagery. A CNN consists of an input layer and an output layer, as well as multiple hidden layers. The hidden layers of a CNN typically consist of convolutional layers, pooling layers, fully connected layers, and normalization layers. CNN also has many applications such as image and video recognition, recommender systems, and natural language processing. 47 | 48 | We fine tuned the pretrained ResNet model without the last two layers. 49 | 50 |

51 | 52 |

53 | 54 | ### 3.2. Decoder - Long Short-Term Memory 55 | 56 | LSTM is a basic deep learning model and capable of learning long-term dependencies. A LSTM internal unit is composed of a cell, an input gate, an output gate, and a forget gate. LSTM internal units have hidden state augmented with nonlinear mechanisms to allow state to propagate without modification, be updated, or be reset, using simple learned gating functions. LSTM work tremendously well on various problems, such as natural language text compression, handwriting recognition, and electric load forecasting. 57 | 58 |

59 | 60 |

61 | 62 | ### 3.3. Attention Mechanism - Badhanau style 63 | 64 | As the name suggests, the attention module on each step of the decoder, uses direct connection to the encoder to focus on a particular part of the source image. 65 | 66 |

67 | 68 |

69 | 70 | 71 | ## 4. Training 72 | 73 | - Number of training parameters: 136,587,749 74 | - Learning rate: 3e-5 75 | - Teacher forcing ratio: 0. This means we don't train using true captions but the previous generated caption. 76 | - Number of epochs: 15 77 | - Batch size: 32 78 | - Loss function: Crossentropy 79 | - Optimizer: RMSProp 80 | - Metrics: Top5 accuracy & BLEU (more below) 81 | 82 | ## 5. Evaluation - BLEU (Bilingual Evaluation Understudy) 83 | 84 | It compares the machine-generated captions to one or several human-written caption(s), and computes a similarity score based on: 85 | - N-gram precision (we use 4-grams here) 86 | - Plus a penalty for too-short system translations 87 | 88 | ## 6. Inference - Beam Search 89 | While training, we use greedy decoding by taking the argmax. But the problem with this method is that there's no way to undo decision. Instead, we use in inference mode the beam search technique. On each step of decoder, we keep track of the k most probable partial captions (which we call hypotheses); k is the beam size (here 5). Neverthless, using beam search does not guaranteed finding optimal solution. 90 | 91 | # References 92 | - Olah, C., & Carter, S. (2016). Attention and Augmented Recurrent Neural Networks. 93 | - Papineni, K., Roukos, S., Ward, T., & Zhu, W. (2002). Bleu: a Method for Automatic Evaluation of Machine Translation. ACL. 94 | - Jay Alammar - Visualizing A Neural Machine Translation Model (Mechanics of Seq2seq Models With Attention) - [link](http://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/) 95 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | joblib==0.13.2 2 | numpy==1.22.0 3 | Pillow==10.0.1 4 | spacy==2.2.4 5 | torch==1.13.1 6 | torchtext==0.6.0 7 | torchvision==0.4.0a0+d31eafa 8 | tqdm==4.46.1 9 | --------------------------------------------------------------------------------