Learning to LEAP: Efficient Dense Point Tracking by Focusing Where It Matters

Authors

  • Chenzhi Zhao Beijing University of Posts and Telecommunications
  • Wufan Wang Beijing University of Posts and Telecommunications
  • Bo Zhang Beijing University of Posts and Telecommunications
  • Wendong Wang Beijing University of Posts and Telecommunications

DOI:

https://doi.org/10.1609/aaai.v40i15.38311

Abstract

Tracking Any Point (TAP) is a foundational task in computer vision with broad applicability. The state-of-the-art self-supervised TAP method leverages a global matching transformer and contrastive random walks to learn point correspondences. However, its dense all-pairs attention and correlation volume computation tend to introduce irrelevant features and produce less informative training signals, degrading both learning efficiency and tracking accuracy. To address these limitations, we introduce LEAP-Track, a self-supervised TAP approach that computes the attention matrices and correlation volume over adaptively selected sparse pairs. It consists of two core designs: (1) Curriculum-based Sparse Attention (CSA), which dynamically focuses on the most relevant keys, promoting the learning of discriminative features; and (2) Progressive k-NN Transition (PkT), which reformulates the contrastive random walk to operate on an increasingly sparse k-NN affinity graph to reinforce the learning of the most informative correspondences. By integrating the above two designs into a two-stage training paradigm, LEAP-Track is shown both theoretically and empirically to effectively boost learning efficiency, achieving superior tracking accuracy over existing self-supervised TAP methods.

Downloads

Published

2026-03-14

How to Cite

Zhao, C., Wang, W., Zhang, B., & Wang, W. (2026). Learning to LEAP: Efficient Dense Point Tracking by Focusing Where It Matters. Proceedings of the AAAI Conference on Artificial Intelligence, 40(15), 13108–13116. https://doi.org/10.1609/aaai.v40i15.38311

Issue

Section

AAAI Technical Track on Computer Vision XII