Multi-Branch Self-Drafting for LLM Inference Acceleration

Authors

  • Zipeng Gao University of Science and Technology of China
  • Qingrong Xia Huawei Cloud
  • Tong Xu University of Science and Technology of China
  • Xinyu Duan Huawei Cloud
  • Zhi Zheng University of Science and Technology of China
  • Zhefeng Wang Huawei Cloud
  • Enhong Chen University of Science and Technology of China

DOI:

https://doi.org/10.1609/aaai.v39i22.34567

Abstract

The autoregressive decoding paradigm endows large language models (LLMs) with superior language generation capabilities; however, its step-by-step decoding process inherently limits decoding speed. To mitigate these constraints, the prevalent “draft and validation” strategy enables parallel validation of candidate drafts, allowing LLMs to decode multiple tokens simultaneously during one model forward propagation. However, existing methodologies for obtaining drafts often incur additional overhead in communication or training process, or statistical biases from the corpus. To this end, we propose an innovative draft generation and maintenance approach that leverages the capabilities of LLM itself. Specifically, we extend the autoregressive decoding paradigm to a multi-branch drafting procedure, which can efficiently generate draft sequences without any additional models or training process, while preserving the quality of the generated content by maintaining LLM parameters. Experiments across various open-source benchmarks show that our method generates 2.0 to 3.2 tokens per forward step and achieves around 2 times improvement of end-to-end throughput compared to the autoregressive decoding strategy.

Downloads

Published

2025-04-11

How to Cite

Gao, Z., Xia, Q., Xu, T., Duan, X., Zheng, Z., Wang, Z., & Chen, E. (2025). Multi-Branch Self-Drafting for LLM Inference Acceleration. Proceedings of the AAAI Conference on Artificial Intelligence, 39(22), 23942-23950. https://doi.org/10.1609/aaai.v39i22.34567

Issue

Section

AAAI Technical Track on Natural Language Processing I