Hierarchical Context Pruning: Optimizing Real-World Code Completion with Repository-Level Pretrained Code LLMs

Lei Zhang; Yunshui Li; Jiaming Li; Xiaobo Xia; Jiaxi Yang; Run Luo; Minzheng Wang; Longze Chen; Junhao Liu; Qiang Qu; Min Yang

doi:10.1609/aaai.v39i24.34782

Authors

Lei Zhang Shenzhen Key Laboratory for High Performance Data Mining, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences University of Chinese Academy of Sciences Key Laboratory of Intelligent Education Technology and Application of Zhejiang Province, Zhejiang Normal University
Yunshui Li Shenzhen Key Laboratory for High Performance Data Mining, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences University of Chinese Academy of Sciences
Jiaming Li Shenzhen Key Laboratory for High Performance Data Mining, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences University of Chinese Academy of Sciences
Xiaobo Xia School of Computing, National University of Singapore University of Science and Technology of China
Jiaxi Yang Shenzhen Key Laboratory for High Performance Data Mining, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences University of Chinese Academy of Sciences
Run Luo Shenzhen Key Laboratory for High Performance Data Mining, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences University of Chinese Academy of Sciences
Minzheng Wang University of Chinese Academy of Sciences Institute of automation, Chinese academy of science, Chinese Academy of Sciences
Longze Chen Shenzhen Key Laboratory for High Performance Data Mining, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences University of Chinese Academy of Sciences
Junhao Liu University of California, Irvine
Qiang Qu Shenzhen Key Laboratory for High Performance Data Mining, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences
Min Yang Shenzhen Key Laboratory for High Performance Data Mining, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences Key Laboratory of Intelligent Education Technology and Application of Zhejiang Province, Zhejiang Normal University

DOI:

https://doi.org/10.1609/aaai.v39i24.34782

Abstract

Some of the latest released Code Large Language Models (Code LLMs) have been trained on repository-level code data, enabling them to perceive repository structures and utilize cross-file code information. This capability allows us to directly concatenate the content of repository code files in prompts to achieve repository-level code completion. However, in real development scenarios, directly concatenating all code repository files in a prompt can easily exceed the context window of Code LLMs, leading to a significant decline in completion performance. Additionally, overly long prompts can increase completion latency, negatively impacting the user experience. In this study, we conducted extensive experiments, including completion error analysis, topology dependency analysis, and cross-file content analysis, to investigate the factors affecting repository-level code completion. Based on the conclusions drawn from these preliminary experiments, we proposed a strategy called **Hierarchical Context Pruning (HCP)** to construct high-quality completion prompts. We applied the **HCP** to six Code LLMs and evaluated them on the CrossCodeEval dataset. The experimental results showed that, compared to previous methods, the prompts constructed using our **HCP** strategy achieved higher completion accuracy on five out of six Code LLMs. Additionally, the **HCP** managed to keep the prompt length around 8k tokens (whereas the full repository code is approximately 50k tokens), significantly improving completion throughput. Our code and data will be publicly available.

Hierarchical Context Pruning: Optimizing Real-World Code Completion with Repository-Level Pretrained Code LLMs

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information