DIAA: A Decoding-Efficient Inference Acceleration Approach for On-Device Large Language Models

Hao Tian; Sheng Lu; Fuwen Tian; Guangming Cui; Zheng Li; Xuyun Zhang; Quan Z. Sheng; Wanchun Dou

doi:10.1609/aaai.v40i31.39789

Authors

Hao Tian Nanjing University
Sheng Lu Nanjing University
Fuwen Tian Nanjing University
Guangming Cui Nanjing University of Information Science and Technology
Zheng Li Nanjing University
Xuyun Zhang Macquarie University
Quan Z. Sheng Macquarie University
Wanchun Dou Nanjing University

DOI:

https://doi.org/10.1609/aaai.v40i31.39789

Abstract

Large Language Models (LLMs) have revolutionized intelligent interactions, enabling mobile applications such as personal assistants on edge devices for local execution. Speculative decoding (SD) has emerged as a promising paradigm to accelerate LLM inference without compromising generation quality, employing a draft-then-verify manner. However, due to the constrained computing and memory resources on edge devices, existing SD works heavily rely on an auxiliary draft model that incurs additional memory burden and hinders the adaptability, as well as static token trees that yield suboptimal inference performance. To this end, we propose DIAA, a Decoding-efficient Inference Acceleration Approach for on-device LLMs. DIAA achieves plug-and-play and model-agnostic inference speedup with memory and computation efficiency for edge devices. Specifically, a pair of lightweight look-up tables (LUTs) is constructed by Top-K token sampling to cache historical tokens and probabilities for rapid candidate drafting. DIAA integrates a dynamic token tree with prior LUTs enabling paralleled verification, updated during decoding process, to adapt the online context. A computation overlap is then employed to pipeline the update operations of token tree, LUTs, and KV cache to improve the computational efficiency. Finally, through extensive experiments implemented on edge platform NVIDIA Jetson, DIAA outperforms existing baselines in generation speed and inference wall-clock time, while incurring minimal memory overhead.

DIAA: A Decoding-Efficient Inference Acceleration Approach for On-Device Large Language Models

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information