Localize, Assemble, and Predicate: Contextual Object Proposal Embedding for Visual Relation Detection
Visual relation detection (VRD) aims to describe all interacting objects in an image using subject-predicate-object triplets. Critically, valid relations combinatorially grow in O(C2R) for C object categories and R relationships. The frequencies of relation triplets exhibit a long-tailed distribution, which inevitably leads to bias towards popular visual relations in the learned VRD model. To address this problem, we propose localize-assemble-predicate network (LAP-Net), which decomposes VRD into three sub-tasks: localizing individual objects, assembling and predicting the subject-object pairs. In the first stage of LAP-Net, Region Proposal Network (RPN) is used to generate a few class-agnostic object proposals. Next, these proposals are assembled to form subject-object pairs via a second Pair Proposal Network (PPN), in which we propose a novel contextual embedding scheme. The inner product between embedded representations faithfully reflects the compatibility between a pair of proposals, without estimating object and subject class. Top-ranked pairs from stage two are fed into a third sub-network, which precisely estimates the relationship. The whole pipeline except for the last stage is object-category-agnostic in localizing relationships in an image, alleviating the bias in popular relations induced by training data. Our LAP-Net can be trained in an end-to-end fashion. We demonstrate that LAP-Net achieves state-of-the-art performance on the VRD benchmark while maintaining high speed in inference.