- Clone this repsoitory using
git clone https://github.com/lucasace/Image_Captioning.git
- Download the Flickr8k Image and Text dataset from here and here respectively
- Unzip both the dataset and text files and place it inside the repository folder
To train the model simply run
python3 main.py --type train --checkpoint_dir <checkpointdir> --cnnmodel <cnnmodel> --image_folder <imagefolder location> --caption_file <location to token.txt> --feature_extraction <True or False>
- The checkpoint dir is the place where your model checkpoints are going to be saved.
- cnnmodel is either inception or vgg16,default is inception
- imagefolder is location of the folder with all the images
- caption_file is Location to 'Flickr8k.token.txt'
- feature_extraction - True or False,default is True
- True if you havent extracted the image features
- False if you have already extracted the image features This saves time and memory when training again
- batch_size batch_size of training and validation default is 128
python3 main.py --type test --checkpoint_dir <checkpointdir> --cnnmodel <cnnmodel> --image_folder <imagefolder location> --caption_file <location to token,txt> --feature_extraction <True or False>
- Download the checkpoints from here if your cnn_model is inception ,if your cnn_model is vgg 16 download from here or you can use your own trained checkpoints
- All arguments are same as in training model
python3 main.py --type caption --checkpoint_dir <checkpointdir> --cnnmodel <cnnmodel> --caption_file <location to token,txt> --to_caption <image file path to caption>
- Download the checkpoints from here
- Note these are inception checkpoints and for vgg16 download from here
- captionfile is required to make the vocabulary
if you want to train it on a custom dataset kindly make changes in the dataset.py folder to make it suitable for your dataset
Model Type | CNN_Model | Bleu_1 | Bleu_2 | Bleu_3 | Bleu_4 | Meteor |
---|---|---|---|---|---|---|
Encoder-Decoder | Inception_V3 | 60.12 | 51.1 | 48.13 | 39.5 | 25.8 |
VGG16 | 58.46 | 49.87 | 47.50 | 39.37 | 26.32 |
Here are some of the results:
- beam search
- Image Captioning using Soft and Hard Attention
- Image Captioning using Adversarial Training
Any contributions are welcome
If there is any issue with the model or errors in the program, feel free to raise a issue or set up a PR.
- O. Vinyals, A. Toshev, S. Bengio and D. Erhan, "Show and tell: A neural image caption generator," 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, 2015, pp. 3156-3164, doi: 10.1109/CVPR.2015.7298935.
- Tensorflow documentation on Image Captioning
- Machine Learning Mastery for dataset
- nltk documentation for meteor score
- RNN lecture by Standford University