Person Tube Retrieval via Language Description
This paper focuses on the problem of person tube (a sequence of bounding boxes which encloses a person in a video) retrieval using a natural language query. Different from images in person re-identification (re-ID) or person search, besides appearance, person tube contains abundant action and information. We exploit a 2D and a 3D residual networks (ResNets) to extract the appearance and action representation, respectively. To transform tubes and descriptions into a shared latent space where data from the two different modalities can be compared directly, we propose a Multi-Scale Structure Preservation (MSSP) approach. MSSP splits a person tube into several element-tubes on average, whose features are extracted by the two ResNets. Any number of consecutive element-tubes forms a sub-tube. MSSP considers the following constraints for sub-tubes and descriptions in the shared space. 1) Bidirectional ranking. Matching sub-tubes (resp. descriptions) should get ranked higher than incorrect ones for each description (resp. sub-tube). 2) External structure preservation. Sub-tubes (resp. descriptions) from different persons should stay away from each other. 3) Internal structure preservation. Sub-tubes (resp. descriptions) from the same person should be close to each other. Experimental results on person tube retrieval via language description and other two related tasks demonstrate the efficacy of MSSP.