Multi-Branch Self-Drafting for LLM Inference Acceleration

Zipeng Gao; Qingrong Xia; Tong Xu; Xinyu Duan; Zhi Zheng; Zhefeng Wang; Enhong Chen

doi:10.1609/aaai.v39i22.34567

Authors

Zipeng Gao University of Science and Technology of China
Qingrong Xia Huawei Cloud
Tong Xu University of Science and Technology of China
Xinyu Duan Huawei Cloud
Zhi Zheng University of Science and Technology of China
Zhefeng Wang Huawei Cloud
Enhong Chen University of Science and Technology of China

DOI:

https://doi.org/10.1609/aaai.v39i22.34567

Abstract

The autoregressive decoding paradigm endows large language models (LLMs) with superior language generation capabilities; however, its step-by-step decoding process inherently limits decoding speed. To mitigate these constraints, the prevalent “draft and validation” strategy enables parallel validation of candidate drafts, allowing LLMs to decode multiple tokens simultaneously during one model forward propagation. However, existing methodologies for obtaining drafts often incur additional overhead in communication or training process, or statistical biases from the corpus. To this end, we propose an innovative draft generation and maintenance approach that leverages the capabilities of LLM itself. Specifically, we extend the autoregressive decoding paradigm to a multi-branch drafting procedure, which can efficiently generate draft sequences without any additional models or training process, while preserving the quality of the generated content by maintaining LLM parameters. Experiments across various open-source benchmarks show that our method generates 2.0 to 3.2 tokens per forward step and achieves around 2 times improvement of end-to-end throughput compared to the autoregressive decoding strategy.

Multi-Branch Self-Drafting for LLM Inference Acceleration

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information