Chest X-rays are among the most used diagnostic methods worldwide. However, interpreting
these images remains a complex and time-consuming task. This paper addresses the challenge
of automated chest X-ray report generation by leveraging recent advances in vision-language
models (VLMs). We propose a framework that integrates prompt-guided supervision and bias
mitigation techniques into a finetuned VLM (BLIP) to enhance both the accuracy and trustworthiness of generated medical reports.We investigate methods to enhance text coherence
and mitigate shortcut bias in VLMs using pathological prompts, without relying on architecture modifications or multi-objective training, two points of view that have been largely overlooked in existing literature. We use pathology labels as natural language prompts to guide the model and strengthen the model’s robustness through curriculum learning to introduce
controlled label noise during training. To address shortcut learning, where spurious visualtextual correlations (e.g., support devices) can mislead the model, we introduce a multi-modal bias mitigation strategy, combining visual artifact removal using a generative diffusion model with corresponding text modification via a large language model, encouraging more causally
grounded representations. We train and evaluate our approach on the newly released CheXpert Plus dataset, demonstrating improvements in report quality and robustness. Up to a 63% increase is achieved across different evaluation metrics on the test set. Furthermore, our multimodal shortcut bias mitigation method improves clinical coherency of the generated reports,
while shifting the focus of the model towards more relevant regions in the image. Our findings contribute to the development of safer and more trustworthy AI systems in radiology, offering a scalable strategy for enhancing vision-language models.