RoVRM: A Robust Visual Reward Model Optimized via Auxiliary Textual Preference Data

Chenglong Wang; Yang Gan; Yifu Huo; Yongyu Mu; Murun Yang; Qiaozhi He; Tong Xiao; Chunliang Zhang; Tongran Liu; Jingbo Zhu

doi:10.1609/aaai.v39i24.34721

Authors

Chenglong Wang School of Computer Science and Engineering, Northeastern University, Shenyang, China
Yang Gan School of Computer Science and Engineering, Northeastern University, Shenyang, China
Yifu Huo School of Computer Science and Engineering, Northeastern University, Shenyang, China
Yongyu Mu School of Computer Science and Engineering, Northeastern University, Shenyang, China
Murun Yang School of Computer Science and Engineering, Northeastern University, Shenyang, China
Qiaozhi He School of Computer Science and Engineering, Northeastern University, Shenyang, China
Tong Xiao School of Computer Science and Engineering, Northeastern University, Shenyang, China NiuTrans Research, Shenyang, China
Chunliang Zhang School of Computer Science and Engineering, Northeastern University, Shenyang, China
Tongran Liu CAS Key Laboratory of Behavioral Science, Institute of Psychology, CAS, Beijing, China
Jingbo Zhu School of Computer Science and Engineering, Northeastern University, Shenyang, China NiuTrans Research, Shenyang, China

DOI:

https://doi.org/10.1609/aaai.v39i24.34721

Abstract

Large vision-language models (LVLMs) often fail to align with human preferences, leading to issues like generating misleading content without proper visual context (also known as hallucination). A promising solution to this problem is using human-preference alignment techniques, such as best-of-n sampling and reinforcement learning. However, these techniques face the difficulty arising from the scarcity of visual preference data, which is required to train a visual reward model (VRM). In this work, we continue the line of research. We present a Robust Visual Reward Model (RoVRM) which improves human-preference alignment for LVLMs. RoVRM leverages auxiliary textual preference data through a three-phase progressive training and optimal transport-based preference data selection to effectively mitigate the scarcity of visual preference data. We experiment with RoVRM on the commonly used vision-language tasks based on the LLaVA-1.5-7B and -13B models. Experimental results demonstrate that RoVRM consistently outperforms traditional VRMs. Furthermore, our three-phase progressive training and preference data selection approaches can yield consistent performance gains over ranking-based alignment techniques, such as direct preference optimization.

RoVRM: A Robust Visual Reward Model Optimized via Auxiliary Textual Preference Data

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information