ConTextual Masked Auto-Encoder for Dense Passage Retrieval

Authors

  • Xing Wu Institute of Information Engineering, Chinese Academy of Sciences School of Cyber Security, University of Chinese Academy of Sciences Kuaishou Technology
  • Guangyuan Ma Institute of Information Engineering, Chinese Academy of Sciences School of Cyber Security, University of Chinese Academy of Sciences
  • Meng Lin Institute of Information Engineering, Chinese Academy of Sciences School of Cyber Security, University of Chinese Academy of Sciences
  • Zijia Lin Kuaishou Technology
  • Zhongyuan Wang Kuaishou Technology
  • Songlin Hu Institute of Information Engineering, Chinese Academy of Sciences School of Cyber Security, University of Chinese Academy of Sciences

DOI:

https://doi.org/10.1609/aaai.v37i4.25598

Keywords:

DMKM: Web Search & Information Retrieval, SNLP: Language Models

Abstract

Dense passage retrieval aims to retrieve the relevant passages of a query from a large corpus based on dense representations (i.e., vectors) of the query and the passages. Recent studies have explored improving pre-trained language models to boost dense retrieval performance. This paper proposes CoT-MAE (ConTextual Masked Auto-Encoder), a simple yet effective generative pre-training method for dense passage retrieval. CoT-MAE employs an asymmetric encoder-decoder architecture that learns to compress the sentence semantics into a dense vector through self-supervised and context-supervised masked auto-encoding. Precisely, self-supervised masked auto-encoding learns to model the semantics of the tokens inside a text span, and context-supervised masked auto-encoding learns to model the semantical correlation between the text spans. We conduct experiments on large-scale passage retrieval benchmarks and show considerable improvements over strong baselines, demonstrating the high efficiency of CoT-MAE. Our code is available at https://github.com/caskcsg/ir/tree/main/cotmae.

Downloads

Published

2023-06-26

How to Cite

Wu, X., Ma, G., Lin, M., Lin, Z., Wang, Z., & Hu, S. (2023). ConTextual Masked Auto-Encoder for Dense Passage Retrieval. Proceedings of the AAAI Conference on Artificial Intelligence, 37(4), 4738-4746. https://doi.org/10.1609/aaai.v37i4.25598

Issue

Section

AAAI Technical Track on Data Mining and Knowledge Management