What's in a Prompt?: A Large-Scale Experiment to Assess the Impact of Prompt Design on the Compliance and Accuracy of LLM-Generated Text Annotations

Authors

  • Shubham Atreja University of Michigan School of Information
  • Joshua Ashkinaze University of Michigan School of Information
  • Lingyao Li University of Michigan School of Information
  • Julia Mendelsohn University of Michigan School of Information
  • Libby Hemphill University of Michigan School of Information

DOI:

https://doi.org/10.1609/icwsm.v19i1.35807

Abstract

Manually annotating data for computational social science tasks can be costly, time-consuming, and emotionally draining. While recent work suggests that LLMs can perform such annotation tasks in zero-shot settings, little is known about how prompt design impacts LLMs' compliance and accuracy. We conduct a large-scale multi-prompt experiment to test how model selection (GPT-4o, GPT-3.5, PaLM2, and Falcon7b) and prompt design features (definition inclusion, output type, explanation, and prompt length) impact the compliance and accuracy of LLM-generated annotations on four highly relevant and diverse CSS tasks (toxicity, sentiment, rumor stance, and news frames). Our results show that LLM compliance and accuracy are prompt-dependent. For instance, prompting for numerical scores instead of labels reduces all LLMs' compliance and accuracy. Concise prompts can significantly reduce prompting costs but also lead to lower accuracy on tasks like toxicity. Furthermore, minor prompt changes like asking for an explanation can cause large changes in the distribution of LLM-generated labels. By assessing the impact of prompt design on the quality and distribution of LLM-generated annotations, this work serves as both a practical guide and a warning for using LLMs in CSS research.

Downloads

Published

2025-06-07

How to Cite

Atreja, S., Ashkinaze, J., Li, L., Mendelsohn, J., & Hemphill, L. (2025). What’s in a Prompt?: A Large-Scale Experiment to Assess the Impact of Prompt Design on the Compliance and Accuracy of LLM-Generated Text Annotations. Proceedings of the International AAAI Conference on Web and Social Media, 19(1), 122–145. https://doi.org/10.1609/icwsm.v19i1.35807