MedPerturbing LLMs: A Comparative Study of Toxicity, Prompt Tuning, and Jailbreaks in Medical QA

Authors

  • Arash Asgari York University, Toronto, ON Vector Institute for Artificial Intelligence, Toronto, Ontario, Canada
  • Amirreza Naziri York University, Toronto, ON Vector Institute for Artificial Intelligence, Toronto, Ontario, Canada
  • Laleh Seyyed-Kalantari York University, Toronto, ON Vector Institute for Artificial Intelligence, Toronto, Ontario, Canada

DOI:

https://doi.org/10.1609/aaaiss.v7i1.36916

Abstract

Large Language Models (LLMs) are increasingly adopted across domains, including sensitive areas such as healthcare. However, their deployment raises significant safety concerns, particularly with respect to toxicity. In this paper, we evaluate the toxicity of widely used general-purpose LLMs in medical question–answering tasks. We investigate three complementary scenarios: (i) baseline querying, (ii) prompt guidelines designed to mitigate toxic outputs, and (iii) adversarial jailbreak prompting intended to elicit harmful content. To measure toxicity, we apply three established metrics to five LLMs ranging from 2B to 9B parameters, using MedPerturb, a dataset of medical questions systematically perturbed across gender, race, and age. Our results show that while carefully crafted guidelines can reduce toxic outputs and mitigate demographic biases, adversarial instructions are highly effective at bypassing safety mechanisms. Our evaluation reveals that all models exhibit limited resilience to jailbreak attacks, highlighting a critical vulnerability that restricts their safe deployment in clinical contexts. By answering three key questions—(1) what levels of toxicity these models exhibit in standard medical scenarios, (2) how far prompt tuning can reduce toxicity, and (3) how vulnerable they are to jailbreaks, our study provides a structured assessment of the risks and limitations of LLMs in healthcare, and shows the importance of establishing robust guidelines and protections to promote the safe deployment of LLMs in healthcare and to guard against harmful misuse.

Downloads

Published

2025-11-23

How to Cite

Asgari, A., Naziri, A., & Seyyed-Kalantari, L. (2025). MedPerturbing LLMs: A Comparative Study of Toxicity, Prompt Tuning, and Jailbreaks in Medical QA. Proceedings of the AAAI Symposium Series, 7(1), 438-447. https://doi.org/10.1609/aaaiss.v7i1.36916

Issue

Section

Safe, Ethical, Certified, Uncertainty-aware, Robust, and Explainable AI for Health (SECURE-AI4H)