CLNX: Bridging Code and Natural Language for C/C++ Vulnerability-Contributing Commits Identification
DOI:
https://doi.org/10.1609/aaai.v39i23.34689Abstract
Large Language Models (LLMs) have shown great promise in vulnerability identification. As C/C++ comprise half of the open-source Software (OSS) vulnerabilities over the past decade and updates in OSS mainly occur through commits, enhancing LLMs' ability to identify C/C++ Vulnerability-Contributing Commits (VCCs) is essential. However, current studies primarily focus on further pre-training LLMs on massive code datasets, which is resource-intensive and poses efficiency challenges. In this paper, we enhance the ability of BERT-based LLMs to identify C/C++ VCCs in a lightweight manner. We propose CodeLinguaNexus (CLNX) as a bridge facilitating communication between C/C++ programs and LLMs. Based on commits, CLNX efficiently converts the source code into a more natural representation while preserving key details. Specifically, CLNX first applies Structure-level Naturalization to decompose complex programs, followed by Token-level Naturalization to interpret complex symbols. We evaluate CLNX on public datasets of 25,872 C/C++ functions with their commits. The results demonstrate that CLNX substantially improves the ability of LLMs to detect C/C++ VCCs. Moreover, CLNX-equipped CodeBERT achieves new state-of-the-art performance and identifies 38 OSS vulnerabilities in the real world.Downloads
Published
2025-04-11
How to Cite
Qin, Z., Wu, Y., & Han, L. (2025). CLNX: Bridging Code and Natural Language for C/C++ Vulnerability-Contributing Commits Identification. Proceedings of the AAAI Conference on Artificial Intelligence, 39(23), 25047–25055. https://doi.org/10.1609/aaai.v39i23.34689
Issue
Section
AAAI Technical Track on Natural Language Processing II