PocketLLM: Ultimate Compression of Large Language Models via Meta Networks

Authors

  • Ye Tian Huawei Noah's Ark Lab
  • Chengcheng Wang University of Sydney
  • Jing Han Beijing University of Posts and Telecommunications
  • Yehui Tang Huawei Noah's Ark Lab
  • Kai Han Huawei Noah's Ark Lab

DOI:

https://doi.org/10.1609/aaai.v40i39.40610

Abstract

As Large Language Models (LLMs) continue to grow in size, storing and transmitting them on edge devices becomes increasingly challenging. Traditional methods like quantization and pruning struggle to achieve extreme compression of LLMs without sacrificing accuracy. In this paper, we introduce PocketLLM, a novel approach to compress LLMs in a latent space via meta-networks. A simple encoder network is proposed to project the weights of LLMs into discrete latent vectors, which are then represented using a compact codebook. A lightweight decoder network is employed to map the codebook's representative vectors back to the original weight space. This method allows for significant compression of the large weights in LLMs, consisting solely of a small decoder, a concise codebook, and an index. Extensive experiments show that PocketLLM achieves superior performance even at significantly high compression ratios, e.g., compressing Llama 2-7B by 10x with a negligible drop in accuracy.

Downloads

Published

2026-03-14

How to Cite

Tian, Y., Wang, C., Han, J., Tang, Y., & Han, K. (2026). PocketLLM: Ultimate Compression of Large Language Models via Meta Networks. Proceedings of the AAAI Conference on Artificial Intelligence, 40(39), 33250–33258. https://doi.org/10.1609/aaai.v40i39.40610

Issue

Section

AAAI Technical Track on Natural Language Processing IV