Towards Long-window Anchoring in Vision-Language Model Distillation

Authors

  • Haoyi Zhou School of Software, Beihang University, Beijing, China
  • Shuo Li SKLCCSE, School of Computer Science and Engineering, Beihang University, Beijing , China
  • Tianyu Chen SKLCCSE, School of Computer Science and Engineering, Beihang University, Beijing , China
  • Qi Song School of Software, Beihang University, Beijing, China
  • Chonghan Gao SKLCCSE, School of Computer Science and Engineering, Beihang University, Beijing , China
  • Jianxin Li SKLCCSE, School of Computer Science and Engineering, Beihang University, Beijing , China Zhongguancun Laboratory, Beijing, China

DOI:

https://doi.org/10.1609/aaai.v40i34.40131

Abstract

While large vision-language models (VLMs) demonstrate impressive long-context understanding, their prevalent small branches fails on linguistics-photography alignment for limited window size. We discover that knowledge distillation improve students capability as compelementary to Rotary Position Embeddings (RoPE) on certain windows size (anchored from large models). Building on this insight, we propose LAid, which explicitly targets the transfer of long-range attention mechanisms through two complementary components: (1) a progressive distance-weighted attention matching that dynamically emphasizes longer position differences during training, and (2) a learnable RoPE response gain modulation that selectively amplifies position sensitivity where needed. Extensive experiments across multiple model families demonstrate that LAid-distilled models achieve up to 3.2× longer effective context windows compared to baseline small models, while maintaining or improving performance on standard VL benchmarks. Spectral analysis also suggests that LAid successfully preserves crucial low-frequency attention components that conventional methods fail to transfer. Our work not only provides practical techniques for building more efficient long-context VLMs but also offers theoretical insights into how positional understanding emerges and transfers during distillation.

Published

2026-03-14

How to Cite

Zhou, H., Li, S., Chen, T., Song, Q., Gao, C., & Li, J. (2026). Towards Long-window Anchoring in Vision-Language Model Distillation. Proceedings of the AAAI Conference on Artificial Intelligence, 40(34), 28955–28963. https://doi.org/10.1609/aaai.v40i34.40131

Issue

Section

AAAI Technical Track on Machine Learning XI