PatientVLM Meets DocVLM: Pre-Consultation Dialogue Between Vision-Language Models for Efficient Diagnosis

Authors

  • K Lokesh Indian Institute of Technology Jodhpur
  • Abhirama Subramanyam Penamakuri Indian Institute of Technology Jodhpur
  • Uday Agarwal Indian Institute of Technology Jodhpur
  • Apoorva Challa All India Institute of Medical Sciences, New Delhi
  • Shreya K Gowda All India Institute of Medical Sciences, New Delhi
  • Somesh Gupta All India Institute of Medical Sciences, New Delhi
  • Anand Mishra Indian Institute of Technology Jodhpur

DOI:

https://doi.org/10.1609/aaai.v40i9.37688

Abstract

Traditionally, AI research in medical diagnosis has largely centered on image analysis. While this has led to notable advancements, the absence of patient-reported symptoms continues to hinder diagnostic accuracy. To address this, we propose a Pre-Consultation Dialogue Framework (PCDF) that mimics real-world diagnostic procedures, where doctors iteratively query patients before reaching a conclusion. Specifically, we simulate diagnostic dialogues between two vision–language models (VLMs): a DocVLM, which generates follow-up questions based on the image and dialogue history, and a PatientVLM, which responds using a symptom profile derived from the ground-truth diagnosis. We additionally conducted a small-scale clinical validation of the synthetic symptoms generated by our framework, with licensed clinicians confirming their clinical relevance, symptom coverage, and overall realism. These findings indicate that the resulting DocVLM–PatientVLM interactions form coherent, multi-turn consultations paired with images and diagnoses, which we then use to fine-tune the DocVLM. This dialogue-based supervision leads to substantial gains over image-only training, highlighting the value of realistic symptom elicitation for diagnosis.

Downloads

Published

2026-03-14

How to Cite

Lokesh, K., Penamakuri, A. S., Agarwal, U., Challa, A., Gowda, S. K., Gupta, S., & Mishra, A. (2026). PatientVLM Meets DocVLM: Pre-Consultation Dialogue Between Vision-Language Models for Efficient Diagnosis. Proceedings of the AAAI Conference on Artificial Intelligence, 40(9), 7485–7493. https://doi.org/10.1609/aaai.v40i9.37688

Issue

Section

AAAI Technical Track on Computer Vision VI