Diffusion-Assisted Progressive Learning for Weakly Supervised Phrase Localization

Authors

  • Pengyue Lin Beijing University of Posts and Telecommunications
  • Yanyang Hu Beijing University of Posts and Telecommunications
  • Xinjing Liu Beijing University of Posts and Telecommunications
  • Wenqi Jia Beijing University of Posts and Telecommunications
  • Fangxiang Feng Beijing University of Posts and Telecommunications
  • Ruifan Li Beijing University of Posts and Telecommunications

DOI:

https://doi.org/10.1609/aaai.v40i38.40473

Abstract

Weakly supervised phrase localization (WSPL) aims to localize visual objects mentioned by given phrases, but it learns without human-annotated bounding boxes. Previous works struggle in multi-object scenarios where objects in the background often appear simultaneously with the target objects. To this end, we propose a Diffusion-Assisted PrOgressive learning framework (i.e., DAPO) for WSPL task in this paper. Specifically, we score the difficulty of training samples based on the quantity of objects and the level of semantic alignment. These samples are then used progressively during training, in an order by their difficulty scores. To address the sample imbalance problem, we propose a Generation-Assisted Tuning (GAT) method for the grounding network. First, to enrich the samples from few-object scenarios, we leverage Stable Diffusion (SD) to generate images with phrases. Second, we introduce an attention-driven scheme to direct SD's attention on the mentioned objects. Finally, we design a diffusion-guided loss, which helps the grounding network learn the objects' layouts. Extensive experiments show that our DAPO framework outperforms the strong baselines on benchmark datasets.

Downloads

Published

2026-03-14

How to Cite

Lin, P., Hu, Y., Liu, X., Jia, W., Feng, F., & Li, R. (2026). Diffusion-Assisted Progressive Learning for Weakly Supervised Phrase Localization. Proceedings of the AAAI Conference on Artificial Intelligence, 40(38), 32024–32032. https://doi.org/10.1609/aaai.v40i38.40473

Issue

Section

AAAI Technical Track on Natural Language Processing III