Two branch similarity neural network to perform image retrieval given a text query or caption on Flickr30k dataset. The two branches encodes the images and the captions into a embedding vector of length 128 which is then compared using cosine similarity to rank the output results.
The model was built using two neural networks (one for the images and other for the captions) and was trained using the triplet loss function
.
Flickr30k dataset was used to train the models
- The process for getting the dataset can be found here (The dataset is also available on kaggle).
- The splits for train, val, and test were used from here
The dataset consists of 31,783
images out of which 1000
were used for validation and testing each.
- The image branch (on the left) takes in an image of size 256 x 256.
- The Inception network has fixed weights and not trainable. (The embedding for each image was generated before hand using the Inception V3 to save memory)
- The caption branch (on the right) takes in stemmed captions.
- The tokenizer encodes the input into one-hot array of size 13,388 (vocabulary size).
m
: margin between the positive and negative similarities (parameter).d
: distance function (cosine similarity is used)xi
: training imageyp
: positive caption for imagexi
yn
: negative caption for imagexi
yi
: training captionxp
: positive image for captionyi
xn
: negative image for captionyi
More details about the ranking loss can be found here.
- Close to
1
~ High Similarity - Close to
-1
~ Low Similarity
- The training was done using random sampling method on
batch size
of 64 andmargin
of 0.5. - The trained models are available in the
models
folder.
Ground truth image:
Query The surfer is in the wave .
Position in results: 7th
Output results:
Other results can be found in image_text_learning_one_hot.ipynb.
- Bryan A. Plummer, et al. "Flickr30K Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models". IJCV 123. 1(2017): 74-93.
- Peter Young, et al. "From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions". TACL 2. (2014): 67–78.
- Sethu Hareesh Kolluru, . "A neural architecture to learn image-text joint embedding.". - link