Uncovering LLM-Generated Code: A Zero-Shot Synthetic Code Detector via Code Rewriting

Authors

  • Tong Ye Zhejiang University
  • Yangkai Du Zhejiang University
  • Tengfei Ma State University of New York at Stony Brook
  • Lingfei Wu Anytime AI
  • Xuhong Zhang Zhejiang University
  • Shouling Ji Zhejiang University
  • Wenhai Wang Zhejiang University

DOI:

https://doi.org/10.1609/aaai.v39i1.32082

Abstract

Large Language Models (LLMs) have demonstrated remarkable proficiency in generating code. However, the misuse of LLM-generated (synthetic) code has raised concerns in both educational and industrial contexts, underscoring the urgent need for synthetic code detectors. Existing methods for detecting synthetic content are primarily designed for general text and struggle with code due to the unique grammatical structure of programming languages and the presence of numerous ``low-entropy'' tokens. Building on this, our work proposes a novel zero-shot synthetic code detector based on the similarity between the original code and its LLM-rewritten variants. Our method is based on the observation that differences between LLM-rewritten and original code tend to be smaller when the original code is synthetic. We utilize self-supervised contrastive learning to train a code similarity model and evaluate our approach on two synthetic code detection benchmarks. Our results demonstrate a significant improvement over existing SOTA synthetic content detectors, delivering notable gains in both performance and robustness on the APPS and MBPP benchmarks.

Published

2025-04-11

How to Cite

Ye, T., Du, Y., Ma, T., Wu, L., Zhang, X., Ji, S., & Wang, W. (2025). Uncovering LLM-Generated Code: A Zero-Shot Synthetic Code Detector via Code Rewriting. Proceedings of the AAAI Conference on Artificial Intelligence, 39(1), 968–976. https://doi.org/10.1609/aaai.v39i1.32082

Issue

Section

AAAI Technical Track on Application Domains