Combining Multiple Supervision for Robust Zero-Shot Dense Retrieval

Authors

  • Yan Fang Quan Cheng Laboratory & DCST, Tsinghua University & Zhongguancun Laboratory
  • Qingyao Ai Quan Cheng Laboratory & DCST, Tsinghua University & Zhongguancun Laboratory
  • Jingtao Zhan Quan Cheng Laboratory & DCST, Tsinghua University & Zhongguancun Laboratory
  • Yiqun Liu Quan Cheng Laboratory & DCST, Tsinghua University & Zhongguancun Laboratory
  • Xiaolong Wu Huawei Poisson Lab
  • Zhao Cao Huawei Poisson Lab

DOI:

https://doi.org/10.1609/aaai.v38i16.29755

Keywords:

NLP: Applications, DMKM: Web, NLP: Information Extraction

Abstract

Recently, dense retrieval (DR) models, which represent queries and documents with fixed-width vectors and retrieve relevant ones via nearest neighbor search, have drawn increasing attention from the IR community. However, previous studies have shown that the effectiveness of DR critically relies on sufficient training signals, which leads to severe performance degradation when applied in out-of-domain scenarios, where large-scale training data are usually unavailable. To solve this problem, existing studies adopt a data-augmentation-plus-joint-training paradigm to construct weak/pseudo supervisions on the target domain and combine them with the large-scale human annotated data on the source domain to train the DR models. However, they don't explicitly distinguish the data and the supervision signals in the training process and simply assume that the DR models are mighty enough to capture and memorize different domain knowledge and relevance matching patterns without guidance, which, as shown in this paper, is not true. Based on this observation, we propose a Robust Multi-Supervision Combining strategy (RMSC) that decouples the domain and supervision signals by explicitly telling the DR models how the domain data and supervision signals are combined in the training data with specially designed soft tokens. With the extra soft tokens to store the domain-specific and supervision-specific knowledge, RMSC allows the DR models to conduct retrieval based on human-like relevance matching patterns and target-specific language distribution on the target domain without human annotations. Extensive experiments on zero-shot DR benchmarks show that RMSC significantly improves the ranking performance on the target domain compared to strong DR baselines and domain adaptation methods, while being stable during training and can be combined with query generation or second-stage pre-training.

Published

2024-03-24

How to Cite

Fang, Y., Ai, Q., Zhan, J., Liu, Y., Wu, X., & Cao, Z. (2024). Combining Multiple Supervision for Robust Zero-Shot Dense Retrieval. Proceedings of the AAAI Conference on Artificial Intelligence, 38(16), 17994-18002. https://doi.org/10.1609/aaai.v38i16.29755

Issue

Section

AAAI Technical Track on Natural Language Processing I