Automated Medical Image Captioning Using Vision Transformer and Generative Pre-trained Transformer-2

Abstract
This paper anticipate a deep learning (DL)-located framework for medical image captioning (IC) utilizing a transformerlocated model. The system is anticipated like it integrates two important parts, a Vision Transformer (VT) for processing ocular data and fetching features from images, and a decoder based on Generative Pre-trained Transformer 2 (GPT-2), that transform these characteristics into understandable, contextually significant natural language depictions. These two works composed to aid automatic creation of textual reports matching medical images. To ensure output quality and relevance, training is conducted on a carefully selected dataset that includes matched medical images and clinical descriptions. The model’s performance was evaluated in this work using the IU X-Ray dataset, a publicly accessible and extensively used benchmark in the medical imaging community. Preliminary diagnostic reporting should be generated automatically, which will reduce labor costs, increase workflow effectiveness, shorten reporting times, and improve consistency with medical records. The transformer-based approach manages long-range dependencies in both the picture and text domains, in contrast to conventional CNN–RNN designs.Better diagnostic interpretability results from more context-aware and semantically rich medical image captions.In this research, we offer an end-to-end medical picture captioning model that is totally transformer-based and does away with recurrent components like LSTM.The paper’s main goal is to make diagnostic reporting easier for radiologists and other medical practitioners. Based on popular evaluation measures including BLEU, METEOR, and ROUGE, the model showed competitive results (98.42% accuracy), suggesting significant potential for practical use.
Keywords: CNN, GPT-2, Medical Imaging, NLP, VT, X-Ray Dataset.

Author(s): Moloy Dhar, Mrinmoy Sen, Bidesh Chakraborty, Suparna Biswas*, Shubhajit Chatterjee
Volume: 7 Issue: 2 Pages: 793-811
DOI: https://doi.org/10.47857/irjms.2026.v07i02.08290