DIAA: A Decoding-Efficient Inference Acceleration Approach for On-Device Large Language Models

Authors

  • Hao Tian Nanjing University
  • Sheng Lu Nanjing University
  • Fuwen Tian Nanjing University
  • Guangming Cui Nanjing University of Information Science and Technology
  • Zheng Li Nanjing University
  • Xuyun Zhang Macquarie University
  • Quan Z. Sheng Macquarie University
  • Wanchun Dou Nanjing University

DOI:

https://doi.org/10.1609/aaai.v40i31.39789

Abstract

Large Language Models (LLMs) have revolutionized intelligent interactions, enabling mobile applications such as personal assistants on edge devices for local execution. Speculative decoding (SD) has emerged as a promising paradigm to accelerate LLM inference without compromising generation quality, employing a draft-then-verify manner. However, due to the constrained computing and memory resources on edge devices, existing SD works heavily rely on an auxiliary draft model that incurs additional memory burden and hinders the adaptability, as well as static token trees that yield suboptimal inference performance. To this end, we propose DIAA, a Decoding-efficient Inference Acceleration Approach for on-device LLMs. DIAA achieves plug-and-play and model-agnostic inference speedup with memory and computation efficiency for edge devices. Specifically, a pair of lightweight look-up tables (LUTs) is constructed by Top-K token sampling to cache historical tokens and probabilities for rapid candidate drafting. DIAA integrates a dynamic token tree with prior LUTs enabling paralleled verification, updated during decoding process, to adapt the online context. A computation overlap is then employed to pipeline the update operations of token tree, LUTs, and KV cache to improve the computational efficiency. Finally, through extensive experiments implemented on edge platform NVIDIA Jetson, DIAA outperforms existing baselines in generation speed and inference wall-clock time, while incurring minimal memory overhead.

Downloads

Published

2026-03-14

How to Cite

Tian, H., Lu, S., Tian, F., Cui, G., Li, Z., Zhang, X., … Dou, W. (2026). DIAA: A Decoding-Efficient Inference Acceleration Approach for On-Device Large Language Models. Proceedings of the AAAI Conference on Artificial Intelligence, 40(31), 25896–25904. https://doi.org/10.1609/aaai.v40i31.39789

Issue

Section

AAAI Technical Track on Machine Learning VIII