CP-Search: A Chain Progressive Search Training Framework Incentivizing the Cognitive Behaviors for Searching in LLMs

Zehua Wang; Shipeng Li; Buzhou Tang

doi:10.1609/aaai.v40i40.40666

Authors

Zehua Wang Department of Computer Science, Harbin Institue of Technology (Shenzhen), Shenzhen, China Guangdong Provincial Key Laboratory of Intelligent Information Processing
Shipeng Li Department of Computer Science, Harbin Institue of Technology (Shenzhen), Shenzhen, China
Buzhou Tang Department of Computer Science, Harbin Institue of Technology (Shenzhen), Shenzhen, China Guangdong Provincial Key Laboratory of Intelligent Information Processing Peng Cheng Laboratory, Shenzhen, China

DOI:

https://doi.org/10.1609/aaai.v40i40.40666

Abstract

Retrieval-Augmented Generation (RAG) has been demonstrated to effectively mitigate the knowledge recency issue in Large Language Models (LLMs) while significantly reducing hallucinations. However, existing RAG methods exhibit insufficient capability in modeling reasoning paths for complex multi-hop reasoning tasks. While Reinforcement Learning (RL) has demonstrated success in enhancing model reasoning ability, Token-level RL frameworks exhibit inherent limitations in maintaining coherent reasoning trajectories. This approach remains susceptible to the compounding accumulation of contextual errors during the retrieval process, ultimately resulting in erroneous output generation. To address this challenge, we propose Chain Progressive Search (CP-Search), a novel two-stage training framework designed to enhance the model's retrieval capability in complex scenarios. This framework models the entire retrieval process as a Retrieval-level Markov Decision Process, systematically optimizing the model's retrieval behavior at each step of the chained retrieval. Specifically, CP-Search first constructs a retrieval-cognitive behavioral dataset and employs Supervised Fine-Tuning (SFT) to endow the model with cognitive behaviors for searching. More importantly, by introducing a dense progressive procedural reward in reinforcement learning training, CP-Search significantly improves the model's reasoning consistency and feedback correction ability in chained retrieval. Experiments conducted on multiple multi-hop datasets demonstrate that CP-Search significantly outperforms existing RAG methods in complex multi-hop reasoning tasks.

CP-Search: A Chain Progressive Search Training Framework Incentivizing the Cognitive Behaviors for Searching in LLMs

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information