TY - JOUR AU - Cao, Shuqiang AU - Wang, Bairui AU - Zhang, Wei AU - Ma, Lin PY - 2022/06/28 Y2 - 2024/03/28 TI - Visual Consensus Modeling for Video-Text Retrieval JF - Proceedings of the AAAI Conference on Artificial Intelligence JA - AAAI VL - 36 IS - 1 SE - AAAI Technical Track on Computer Vision I DO - 10.1609/aaai.v36i1.19891 UR - https://ojs.aaai.org/index.php/AAAI/article/view/19891 SP - 167-175 AB - In this paper, we propose a novel method to mine the commonsense knowledge shared between the video and text modalities for video-text retrieval, namely visual consensus modeling. Different from the existing works, which learn the video and text representations and their complicated relationships solely based on the pairwise video-text data, we make the first attempt to model the visual consensus by mining the visual concepts from videos and exploiting their co-occurrence patterns within the video and text modalities with no reliance on any additional concept annotations. Specifically, we build a shareable and learnable graph as the visual consensus, where the nodes denoting the mined visual concepts and the edges connecting the nodes representing the co-occurrence relationships between the visual concepts. Extensive experimental results on the public benchmark datasets demonstrate that our proposed method, with the ability to effectively model the visual consensus, achieves state-of-the-art performances on the bidirectional video-text retrieval task. Our code is available at https://github.com/sqiangcao99/VCM. ER -