Towards Long-window Anchoring in Vision-Language Model Distillation

Haoyi Zhou; Shuo Li; Tianyu Chen; Qi Song; Chonghan Gao; Jianxin Li

doi:10.1609/aaai.v40i34.40131

Authors

Haoyi Zhou School of Software, Beihang University, Beijing, China
Shuo Li SKLCCSE, School of Computer Science and Engineering, Beihang University, Beijing , China
Tianyu Chen SKLCCSE, School of Computer Science and Engineering, Beihang University, Beijing , China
Qi Song School of Software, Beihang University, Beijing, China
Chonghan Gao SKLCCSE, School of Computer Science and Engineering, Beihang University, Beijing , China
Jianxin Li SKLCCSE, School of Computer Science and Engineering, Beihang University, Beijing , China Zhongguancun Laboratory, Beijing, China

DOI:

https://doi.org/10.1609/aaai.v40i34.40131

Abstract

While large vision-language models (VLMs) demonstrate impressive long-context understanding, their prevalent small branches fails on linguistics-photography alignment for limited window size. We discover that knowledge distillation improve students capability as compelementary to Rotary Position Embeddings (RoPE) on certain windows size (anchored from large models). Building on this insight, we propose LAid, which explicitly targets the transfer of long-range attention mechanisms through two complementary components: (1) a progressive distance-weighted attention matching that dynamically emphasizes longer position differences during training, and (2) a learnable RoPE response gain modulation that selectively amplifies position sensitivity where needed. Extensive experiments across multiple model families demonstrate that LAid-distilled models achieve up to 3.2× longer effective context windows compared to baseline small models, while maintaining or improving performance on standard VL benchmarks. Spectral analysis also suggests that LAid successfully preserves crucial low-frequency attention components that conventional methods fail to transfer. Our work not only provides practical techniques for building more efficient long-context VLMs but also offers theoretical insights into how positional understanding emerges and transfers during distillation.

Towards Long-window Anchoring in Vision-Language Model Distillation

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information