Localize, Assemble, and Predicate: Contextual Object Proposal Embedding for Visual Relation Detection


  • Ruihai Wu Peking University
  • Kehan Xu Peking University
  • Chenchen Liu Peking University
  • Nan Zhuang Peking University
  • Yadong Mu Peking University




Visual relation detection (VRD) aims to describe all interacting objects in an image using subject-predicate-object triplets. Critically, valid relations combinatorially grow in O(C2R) for C object categories and R relationships. The frequencies of relation triplets exhibit a long-tailed distribution, which inevitably leads to bias towards popular visual relations in the learned VRD model. To address this problem, we propose localize-assemble-predicate network (LAP-Net), which decomposes VRD into three sub-tasks: localizing individual objects, assembling and predicting the subject-object pairs. In the first stage of LAP-Net, Region Proposal Network (RPN) is used to generate a few class-agnostic object proposals. Next, these proposals are assembled to form subject-object pairs via a second Pair Proposal Network (PPN), in which we propose a novel contextual embedding scheme. The inner product between embedded representations faithfully reflects the compatibility between a pair of proposals, without estimating object and subject class. Top-ranked pairs from stage two are fed into a third sub-network, which precisely estimates the relationship. The whole pipeline except for the last stage is object-category-agnostic in localizing relationships in an image, alleviating the bias in popular relations induced by training data. Our LAP-Net can be trained in an end-to-end fashion. We demonstrate that LAP-Net achieves state-of-the-art performance on the VRD benchmark while maintaining high speed in inference.




How to Cite

Wu, R., Xu, K., Liu, C., Zhuang, N., & Mu, Y. (2020). Localize, Assemble, and Predicate: Contextual Object Proposal Embedding for Visual Relation Detection. Proceedings of the AAAI Conference on Artificial Intelligence, 34(07), 12297-12304. https://doi.org/10.1609/aaai.v34i07.6913



AAAI Technical Track: Vision