Improving Long-Context Summarization with Multi-Granularity Retrieval Optimization

Xueyu Chen; Kaitao Song; Zifan Song; Dongsheng Li; Cairong Zhao

doi:10.1609/aaai.v40i36.40283

Authors

Xueyu Chen Tongji University
Kaitao Song Microsoft Research Asia
Zifan Song Tongji University
Dongsheng Li Microsoft Research Asia
Cairong Zhao Tongji University

DOI:

https://doi.org/10.1609/aaai.v40i36.40283

Abstract

Retrieval-Augmented Generation (RAG) is an effective solution to overcome the limitations of Large Language Models (LLMs) in terms of specific-domain knowledge and timely information updates. However, current RAG methods typically respond to queries based on isolated segments, lacking the ability to integrate information within the same document. This undermines performance in real-world tasks requiring coherent understanding across an entire document. Notably, the human brain naturally integrates and summarizes prior knowledge upon reading a given text, progressively formulating a comprehensive understanding. Motivated by this cognitive process, we propose the Hierarchical Two-Stage Summarization-based Information Retrieval (HTSIR) method, which preprocesses the corpus prior to retrieval, summarizes continuous texts to obtain integrated information, and constructs a retrieval tree with varying summary granularities. The retrieved information is then processed by a Reranker based on the current question to serve as a context for LLMs. Additionally, as single-step summarization is often imprecise in query-based summarization tasks, we further apply a Refinement module, allowing LLMs to reflect and revise their output to achieve the final result. By combining HTSIR with GPT-4o mini, we achieve state-of-the-art results on complex question tasks across four long-text datasets (NarrativeQA, QASPER, QuALITY, and QMSum), achieving an improvement of about 6 points on the Question Answering (QA) task in QuALITY-HRAD.

Improving Long-Context Summarization with Multi-Granularity Retrieval Optimization

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information