‫ﺗ‬‫ﻮ‬‫ﺻ‬‫ﯿ‬‫ﻒ‬‫ﺗ‬‫ﺼ‬‫ﺎ‬‫و‬‫ﯾ‬‫ﺮ‬‫ ر‫ا‬‫د‬‫ﯾ‬‫ﻮ‬‫ﮔ‬‫ﺮ‬‫ا‬‫ﻓ‬‫ﯽ‬‫ﻗ‫ﻔ‬‫ﺴ‬‫ﻪ‬‫ي‬‫ﺳ‬‫ﯿ‬‫ﻨ‬‫ﻪ‬‫ﺑ‫ﺎ‬‫ ا‫ﺳ‬‫ﺘ‬‫ﻔ‬‫ﺎ‬‫د‬‫ه ‬‫‬ ا‫ز ‬‫ر‫و‬‫ش‬‫ﻫ‬‫ﺎي ‬‫ﯾ‫ﺎ‬‫د‬‫ﮔ‬‫ﯿ‬‫ﺮ‬‫ي‬ عميق‬, By Habib Rostami

Research

Title	‫‪Methods‬‬‫‪Learning‬‬ ‫‪Deep‬‬ ‫‪using‬‬ ‫‪Rays‬‬ ‫‪-‬‬‫‪X‬‬‫‪Chest‬‬ ‫‪for‬‬ ‫‪ImageCaptioning‬‬
Type	Thesis
Keywords	راديوگرافي، يادگيري عميق
Researchers	mohammad barzegar (Student) , Habib Rostami (First primary advisor) , Ahmad Keshavarz (Advisor)

Abstract

Chest X-rays are among the most used diagnostic methods worldwide. However, interpreting these images remains a complex and time-consuming task. This paper addresses the challenge of automated chest X-ray report generation by leveraging recent advances in vision-language models (VLMs). We propose a framework that integrates prompt-guided supervision and bias mitigation techniques into a finetuned VLM (BLIP) to enhance both the accuracy and trustworthiness of generated medical reports.We investigate methods to enhance text coherence and mitigate shortcut bias in VLMs using pathological prompts, without relying on architecture modifications or multi-objective training, two points of view that have been largely overlooked in existing literature. We use pathology labels as natural language prompts to guide the model and strengthen the model’s robustness through curriculum learning to introduce controlled label noise during training. To address shortcut learning, where spurious visualtextual correlations (e.g., support devices) can mislead the model, we introduce a multi-modal bias mitigation strategy, combining visual artifact removal using a generative diffusion model with corresponding text modification via a large language model, encouraging more causally grounded representations. We train and evaluate our approach on the newly released CheXpert Plus dataset, demonstrating improvements in report quality and robustness. Up to a 63% increase is achieved across different evaluation metrics on the test set. Furthermore, our multimodal shortcut bias mitigation method improves clinical coherency of the generated reports, while shifting the focus of the model towards more relevant regions in the image. Our findings contribute to the development of safer and more trustworthy AI systems in radiology, offering a scalable strategy for enhancing vision-language models.