Semantic Enhancements in Image Captioning: Leveraging Neural Networks to Improve BLIP and GPT-2
Abstract
In the dynamic arena of automated image captioning, significant resources, including energy and manpower, are required to train state-of-the-art models. These models, though effective, necessitate frequent and costly retraining to maintain or enhance their performance. Our Motivation in this thesis has been to explore alternative methods that improve caption accuracy, addressing the unsustainable need for constant retraining. This study assesses the performance of existing state-of-the-art models like BLIP, and GPT-2 on two key datasets: COCO and FLICKR. It evaluates their effectiveness in generating captions and their potential biases across different image types, using metrics such as BLEU, METEOR, and ROUGE. Our primary goal in this thesis was to develop innovative approaches that produce captions more akin to human-generated text, aiming to surpass existing models in quality and efficiency without the need for retraining. We introduced a technique called ‘Weighted Summarization,’ combining artificial neural networks with strategic refinements to leverage the strengths of pre-trained models and set a new benchmark in automated image captioning.Our approach achieved scores on the COCO dataset (BLEU: 0.322, METEOR:0.328, ROUGE-1 f: 0.452, ROUGE-2 f: 0.187, ROUGE-L f: 0.415) and on the FLICKR dataset (BLEU: 0.181, METEOR: 0.300, ROUGE-1 f: 0.348, ROUGE-2 f: 0.107, ROUGE-L f: 0.311), demonstrating enhanced performance over existing models and improved caption quality. In the dynamic arena of automated image captioning, significant resources, including energy and manpower, are required to train state-of-the-art models. These models, though effective, necessitate frequent and costly retraining to maintain or enhance their performance. Our Motivation in this thesis has been to explore alternative methods that improve caption accuracy, addressing the unsustainable need for constant retraining. This study assesses the performance of existing state-of-the-art models like BLIP, and GPT-2 on two key datasets: COCO and FLICKR. It evaluates their effectiveness in generating captions and their potential biases across different image types, using metrics such as BLEU, METEOR, and ROUGE. Our primary goal in this thesis was to develop innovative approaches that produce captions more akin to human-generated text, aiming to surpass existing models in quality and efficiency without the need for retraining. We introduced a technique called ‘Weighted Summarization,’ combining artificial neural networks with strategic refinements to leverage the strengths of pre-trained models and set a new benchmark in automated image captioning.Our approach achieved scores on the COCO dataset (BLEU: 0.322, METEOR:0.328, ROUGE-1 f: 0.452, ROUGE-2 f: 0.187, ROUGE-L f: 0.415) and on the FLICKR dataset (BLEU: 0.181, METEOR: 0.300, ROUGE-1 f: 0.348, ROUGE-2 f: 0.107, ROUGE-L f: 0.311), demonstrating enhanced performance over existing models and improved caption quality.