Sublinear Time Approximation of Text Similarity Matrices

Authors

  • Archan Ray University of Massachusetts Amherst
  • Nicholas Monath University of Massachusetts Amherst
  • Andrew McCallum University of Massachusetts Amherst
  • Cameron Musco University of Massachusetts Amherst

DOI:

https://doi.org/10.1609/aaai.v36i7.20779

Keywords:

Machine Learning (ML)

Abstract

We study algorithms for approximating pairwise similarity matrices that arise in natural language processing. Generally, computing a similarity matrix for n data points requires Omega(n^2) similarity computations. This quadratic scaling is a significant bottleneck, especially when similarities are computed via expensive functions, e.g., via transformer models. Approximation methods reduce this quadratic complexity, often by using a small subset of exactly computed similarities to approximate the remainder of the complete pairwise similarity matrix. Significant work focuses on the efficient approximation of positive semidefinite (PSD) similarity matrices, which arise e.g., in kernel methods. However, much less is understood about indefinite (non-PSD) similarity matrices, which often arise in NLP. Motivated by the observation that many of these matrices are still somewhat close to PSD, we introduce a generalization of the popular Nystrom method to the indefinite setting. Our algorithm can be applied to any similarity matrix and runs in sublinear time in the size of the matrix, producing a rank-s approximation with just O(ns) similarity computations. We show that our method, along with a simple variant of CUR decomposition, performs very well in approximating a variety of similarity matrices arising in NLP tasks. We demonstrate high accuracy of the approximated similarity matrices in tasks of document classification, sentence similarity, and cross-document coreference.

Downloads

Published

2022-06-28

How to Cite

Ray, A., Monath, N., McCallum, A., & Musco, C. (2022). Sublinear Time Approximation of Text Similarity Matrices. Proceedings of the AAAI Conference on Artificial Intelligence, 36(7), 8072-8080. https://doi.org/10.1609/aaai.v36i7.20779

Issue

Section

AAAI Technical Track on Machine Learning II