UniAPO: Unified Multimodal Automated Prompt Optimization

Authors

  • Qipeng Zhu ByteDance Inc. Shanghai Key Laboratory of Intelligent Information Processing, College of Computer Science and Artificial Intelligence, Fudan University
  • Yanzhe Chen ByteDance Inc. School of Computer, National University of Singapore
  • Huasong Zhong ByteDance Inc.
  • Jie Chen College of Computer and Data Science, Fuzhou University
  • Yan Li ByteDance Inc.
  • Zhixin Zhang ByteDance Inc.
  • Junping Zhang Shanghai Key Laboratory of Intelligent Information Processing, College of Computer Science and Artificial Intelligence, Fudan University
  • Zhenheng Yang ByteDance Inc.

DOI:

https://doi.org/10.1609/aaai.v40i34.40151

Abstract

Prompting is fundamental to unlocking the full potential of large language models. To automate and enhance this process, automatic prompt optimization (APO) has been developed, demonstrating effectiveness primarily in text-only input scenarios. However, extending existing APO methods to multimodal tasks—such as video-language generation—introduces two core challenges: (i) visual token inflation, where long visual-token sequences restrict context capacity and result in insufficient feedback signals; (ii) a lack of process-level supervision, as existing methods focus on outcome-level supervision and overlook intermediate supervision, limiting prompt optimization. We present UniAPO: Unified Multimodal Automated Prompt Optimization, the first framework tailored for multimodal APO. UniAPO adopts an EM-inspired optimization process that decouples feedback modeling and prompt refinement, making the optimization more stable and goal-driven. To further address the aforementioned challenges, we introduce a short-long term memory mechanism: historical feedback mitigates context limitations, while historical prompts provide directional guidance for effective prompt optimization. UniAPO achieves consistent gains across text, image, and video benchmarks, establishing a unified framework for efficient and transferable prompt optimization.

Downloads

Published

2026-03-14

How to Cite

Zhu, Q., Chen, Y., Zhong, H., Chen, J., Li, Y., Zhang, Z., … Yang, Z. (2026). UniAPO: Unified Multimodal Automated Prompt Optimization. Proceedings of the AAAI Conference on Artificial Intelligence, 40(34), 29133–29141. https://doi.org/10.1609/aaai.v40i34.40151

Issue

Section

AAAI Technical Track on Machine Learning XI