HiTVideo: Hierarchical Tokenizers for Enhancing Text-to-Video Generation with Autoregressive Large Language Models

Authors

  • Ziqin Zhou University of Adelaide
  • Yifan Yang Microsoft Research Asia
  • Yuqing Yang Microsoft Research Asia
  • Tianyu He Microsoft Research Asia
  • Houwen Peng Microsoft Research Asia
  • Kai Qiu Microsoft Research Asia
  • Qi Dai Microsoft Research Asia
  • Lili Qiu Microsoft Research Asia
  • Chong Luo Microsoft Research Asia
  • Lingqiao Liu University of Adelaide

DOI:

https://doi.org/10.1609/aaai.v40i16.38399

Abstract

Text-to-video generation poses significant challenges due to the inherent complexity of video data, which spans both temporal and spatial dimensions. It introduces additional redundancy, abrupt variations, and a domain gap between language and vision tokens while generation. Addressing these challenges requires an effective video tokenizer that can efficiently encode video data while preserving essential semantic and spatiotemporal information, serving as a critical bridge between text and vision. Inspired by the observation in VQ-VAE-2, we propose HiTVideo, a novel approach for text-to-video generation with hierarchical tokenizers. It utilizes a 3D causal VAE with a multi-layer discrete token framework, encoding video content into hierarchically structured codebooks. Higher layers capture semantic information with higher compression, while lower layers focus on fine-grained spatiotemporal details, striking a balance between compression efficiency and reconstruction quality. Our approach efficiently encodes longer video sequences (e.g., 8 seconds, 64 frames), reducing bits per pixel (bpp) by approximately 70% compared to previous tokenizers, while maintaining competitive reconstruction quality. We explore the trade-offs between compression and reconstruction, while emphasizing the advantages of high-compressed semantic tokens in text-to-video tasks. HiTVideo aims to address the potential limitations of existing video tokenizers in text-to-video generation tasks, striving for higher compression ratios, improved token quality, and simplify LLMs modeling under language guidance, offering a scalable and promising framework for advancing text to video generation.

Published

2026-03-14

How to Cite

Zhou, Z., Yang, Y., Yang, Y., He, T., Peng, H., Qiu, K., … Liu, L. (2026). HiTVideo: Hierarchical Tokenizers for Enhancing Text-to-Video Generation with Autoregressive Large Language Models. Proceedings of the AAAI Conference on Artificial Intelligence, 40(16), 13898–13906. https://doi.org/10.1609/aaai.v40i16.38399

Issue

Section

AAAI Technical Track on Computer Vision XIII