Enhancing LLMs via High-Knowledge Data Selection

Feiyu Duan; Xuemiao Zhang; Sirui Wang; Haoran Que; Yuqi Liu; Wenge Rong; Xunliang Cai

doi:10.1609/aaai.v39i22.34555

Authors

Feiyu Duan Sino-French Engineer School, Beihang University, Beijing, China Meituan, Beijing, China
Xuemiao Zhang Peking University, Beijing, China Meituan, Beijing, China
Sirui Wang Department of Automation, Tsinghua University, Beijing, China Meituan, Beijing, China
Haoran Que Sino-French Engineer School, Beihang University, Beijing, China
Yuqi Liu Meituan, Beijing, China
Wenge Rong School of Computer Science and Engineering, Beihang University, Beijing, China
Xunliang Cai Meituan, Beijing, China

DOI:

https://doi.org/10.1609/aaai.v39i22.34555

Abstract

The performance of Large Language Models (LLMs) is intrinsically linked to the quality of its training data. Although several studies have proposed methods for high-quality data selection, they do not consider the importance of knowledge richness in text corpora. In this paper, we propose a novel and gradient-free High-Knowledge Scorer (HKS) to select high-quality data from the dimension of knowledge, to alleviate the problem of knowledge scarcity in the pre-trained corpus. We propose a comprehensive multi-domain knowledge element pool and introduce knowledge density and coverage as metrics to assess the knowledge content of the text. Based on this, we propose a comprehensive knowledge scorer to select data with intensive knowledge, which can also be utilized for domain-specific high-knowledge data selection by restricting knowledge elements to the specific domain. We train models on a high-knowledge bilingual dataset, and experimental results demonstrate that our scorer improves the model's performance in knowledge-intensive and general comprehension tasks, and is effective in enhancing both the generic and domain-specific capabilities of the model.

Enhancing LLMs via High-Knowledge Data Selection

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information