CLNX: Bridging Code and Natural Language for C/C++ Vulnerability-Contributing Commits Identification

Zeqing Qin; Yiwei Wu; Lansheng Han

doi:10.1609/aaai.v39i23.34689

Authors

Zeqing Qin School of Cyber Science and Engineering, Huazhong University of Science and Technology, Wuhan, China Hubei Key Laboratory of Distributed System Security, Hubei Engineering Research Center on Big Data Security
Yiwei Wu School of Cyber Science and Engineering, Huazhong University of Science and Technology, Wuhan, China
Lansheng Han School of Cyber Science and Engineering, Huazhong University of Science and Technology, Wuhan, China Hubei Key Laboratory of Distributed System Security, Hubei Engineering Research Center on Big Data Security Wuhan JinYinHu Laboratory

DOI:

https://doi.org/10.1609/aaai.v39i23.34689

Abstract

Large Language Models (LLMs) have shown great promise in vulnerability identification. As C/C++ comprise half of the open-source Software (OSS) vulnerabilities over the past decade and updates in OSS mainly occur through commits, enhancing LLMs' ability to identify C/C++ Vulnerability-Contributing Commits (VCCs) is essential. However, current studies primarily focus on further pre-training LLMs on massive code datasets, which is resource-intensive and poses efficiency challenges. In this paper, we enhance the ability of BERT-based LLMs to identify C/C++ VCCs in a lightweight manner. We propose CodeLinguaNexus (CLNX) as a bridge facilitating communication between C/C++ programs and LLMs. Based on commits, CLNX efficiently converts the source code into a more natural representation while preserving key details. Specifically, CLNX first applies Structure-level Naturalization to decompose complex programs, followed by Token-level Naturalization to interpret complex symbols. We evaluate CLNX on public datasets of 25,872 C/C++ functions with their commits. The results demonstrate that CLNX substantially improves the ability of LLMs to detect C/C++ VCCs. Moreover, CLNX-equipped CodeBERT achieves new state-of-the-art performance and identifies 38 OSS vulnerabilities in the real world.

CLNX: Bridging Code and Natural Language for C/C++ Vulnerability-Contributing Commits Identification

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information