Towards Long-window Anchoring in Vision-Language Model Distillation
DOI:
https://doi.org/10.1609/aaai.v40i34.40131Abstract
While large vision-language models (VLMs) demonstrate impressive long-context understanding, their prevalent small branches fails on linguistics-photography alignment for limited window size. We discover that knowledge distillation improve students capability as compelementary to Rotary Position Embeddings (RoPE) on certain windows size (anchored from large models). Building on this insight, we propose LAid, which explicitly targets the transfer of long-range attention mechanisms through two complementary components: (1) a progressive distance-weighted attention matching that dynamically emphasizes longer position differences during training, and (2) a learnable RoPE response gain modulation that selectively amplifies position sensitivity where needed. Extensive experiments across multiple model families demonstrate that LAid-distilled models achieve up to 3.2× longer effective context windows compared to baseline small models, while maintaining or improving performance on standard VL benchmarks. Spectral analysis also suggests that LAid successfully preserves crucial low-frequency attention components that conventional methods fail to transfer. Our work not only provides practical techniques for building more efficient long-context VLMs but also offers theoretical insights into how positional understanding emerges and transfers during distillation.Downloads
Published
2026-03-14
How to Cite
Zhou, H., Li, S., Chen, T., Song, Q., Gao, C., & Li, J. (2026). Towards Long-window Anchoring in Vision-Language Model Distillation. Proceedings of the AAAI Conference on Artificial Intelligence, 40(34), 28955–28963. https://doi.org/10.1609/aaai.v40i34.40131
Issue
Section
AAAI Technical Track on Machine Learning XI