TokenPowerBench: Benchmarking the Power Consumption of LLM Inference

Authors

  • Chenxu Niu Texas Tech University
  • Wei Zhang Texas Advanced Computing Center
  • Jie Li Texas Tech University
  • Yongjian Zhao Texas Tech University
  • Tongyang Wang Texas Tech University
  • Xi Wang Southeast University
  • Yong Chen Texas Tech University

DOI:

https://doi.org/10.1609/aaai.v40i38.40535

Abstract

Large language model (LLM) services now answer billions of queries per day, and industry reports show that inference, not training, accounts for more than 90% of total power consumption. However, existing benchmarks focus on either training/fine-tuning or performance of inference and provide little support for power consumption measurement and analysis of inference. We introduce TokenPowerBench, the first lightweight and extensible benchmark designed for LLM-inference power consumption studies. The benchmark combines a declarative configuration interface covering model choice, prompt set, and inference engine, a measurement layer that captures GPU-, node-, and system-level power without specialized power meters, and a phase-aligned metrics pipeline that attributes energy to the prefill and decode stages of every request. These elements make it straightforward to explore the power consumed by an LLM inference run; furthermore, by varying batch size, context length, parallelism strategy and quantization, users can quickly assess how each setting affects joules per token and other energy-efficiency metrics. We evaluate TokenPowerBench on four of the most widely used model series (Llama, Falcon, Qwen, and Mistral). Our experiments cover from 1 billion parameters up to the frontier-scale Llama3-405B model. Furthermore, we release TokenPowerBench as open source to help users to measure power consumption, forecast operating expenses, and meet sustainability targets when deploying LLM services.

Downloads

Published

2026-03-14

How to Cite

Niu, C., Zhang, W., Li, J., Zhao, Y., Wang, T., Wang, X., & Chen, Y. (2026). TokenPowerBench: Benchmarking the Power Consumption of LLM Inference. Proceedings of the AAAI Conference on Artificial Intelligence, 40(38), 32582–32590. https://doi.org/10.1609/aaai.v40i38.40535

Issue

Section

AAAI Technical Track on Natural Language Processing III