CLNX: Bridging Code and Natural Language for C/C++ Vulnerability-Contributing Commits Identification

Authors

  • Zeqing Qin School of Cyber Science and Engineering, Huazhong University of Science and Technology, Wuhan, China Hubei Key Laboratory of Distributed System Security, Hubei Engineering Research Center on Big Data Security
  • Yiwei Wu School of Cyber Science and Engineering, Huazhong University of Science and Technology, Wuhan, China
  • Lansheng Han School of Cyber Science and Engineering, Huazhong University of Science and Technology, Wuhan, China Hubei Key Laboratory of Distributed System Security, Hubei Engineering Research Center on Big Data Security Wuhan JinYinHu Laboratory

DOI:

https://doi.org/10.1609/aaai.v39i23.34689

Abstract

Large Language Models (LLMs) have shown great promise in vulnerability identification. As C/C++ comprise half of the open-source Software (OSS) vulnerabilities over the past decade and updates in OSS mainly occur through commits, enhancing LLMs' ability to identify C/C++ Vulnerability-Contributing Commits (VCCs) is essential. However, current studies primarily focus on further pre-training LLMs on massive code datasets, which is resource-intensive and poses efficiency challenges. In this paper, we enhance the ability of BERT-based LLMs to identify C/C++ VCCs in a lightweight manner. We propose CodeLinguaNexus (CLNX) as a bridge facilitating communication between C/C++ programs and LLMs. Based on commits, CLNX efficiently converts the source code into a more natural representation while preserving key details. Specifically, CLNX first applies Structure-level Naturalization to decompose complex programs, followed by Token-level Naturalization to interpret complex symbols. We evaluate CLNX on public datasets of 25,872 C/C++ functions with their commits. The results demonstrate that CLNX substantially improves the ability of LLMs to detect C/C++ VCCs. Moreover, CLNX-equipped CodeBERT achieves new state-of-the-art performance and identifies 38 OSS vulnerabilities in the real world.

Published

2025-04-11

How to Cite

Qin, Z., Wu, Y., & Han, L. (2025). CLNX: Bridging Code and Natural Language for C/C++ Vulnerability-Contributing Commits Identification. Proceedings of the AAAI Conference on Artificial Intelligence, 39(23), 25047–25055. https://doi.org/10.1609/aaai.v39i23.34689

Issue

Section

AAAI Technical Track on Natural Language Processing II