Improving Image Captioning Accuracy Using Advanced Deep Learning Techniques

Jacob, Navin Chandar; Ganesh, Kavitha; Sethuraman, Aakash

doi:10.33166/AETiC.2025.02.004

Paper #4

Improving Image Captioning Accuracy Using Advanced Deep Learning Techniques

Navin Chandar Jacob, Kavitha Ganesh and Aakash Sethuraman

Abstract: Image Captioning is a widely used and impactful application of Deep Learning that involves describing an image concisely and accurately. Researchers have adopted various strategies to build systems that are efficient to use in a wide range of real-life applications. The key challenges encountered are twofold - first, the need for a large volume of human created images and their corresponding captions and second, computationally intensive training required to build the model. To tackle both the challenges effectively, a novel architecture called Stacked GAN and Gated Recurrent Units Image Caption generator (STAGRIC) is proposed to accomplish the two objectives. The novelty in the architecture addresses the design concerns of building an efficient and accurate model with limited data. The first objective is accomplished using stacked GAN to synthesise images from captions which are used to augment the datasets for training. This approach supports the generation of an accurate model with limited availability of original data. The second objective, to build a model that is computationally less intensive, is accomplished using GRU based visual attention mechanism to generate captions from images. The proposed STAGRIC model is tested using MS COCO dataset and the model evaluation is performed using different combinations of images and captions datasets. The evaluation results demonstrated improved image captioning analysis metrics, and the BLEU-1 scores increased to above 75% which is higher than similar models in this space. Prospective techniques to further improve the model performance to produce higher evaluation scores are discussed in the concluding section.

Keywords: Deep Learning; Gated Recurrent Units; Generative Models; Image Captioning; Image Synthesis; Recurrent Neural Networks; Stacked Generative Adversarial Network.