On Finding Hubs in High Dimensions with Sampling

Authors

  • Huiwen Dong Beijing Normal University
  • Linghan Zeng University of Auckland
  • Zhiwen Zhao Beijing Normal University
  • Francesco Silvestri University of Padua
  • Ninh Pham University of Auckland

DOI:

https://doi.org/10.1609/aaai.v39i11.33261

Abstract

Hubs are a few points that frequently appear in the k-nearest neighbors (kNN) of many other points in a high-dimensional data set. The hubs' effects, called the hubness phenomenon, degrade the performance of kNN based models in high dimensions. We present SamHub, a simple sampling approach to efficiently identify hubs with theoretical guarantees. Apart from previous works based on approximate kNN indexes, SamHub is generic and applicable to any distance measure with negligible additional memory footprint. Empirically, by sampling only 10% of points, SamHub runs significantly faster and offers higher accuracy than existing hub detection methods on many real-world data sets with dot product, L1, L2, and dynamic time warping distances. Our ablation studies of SamHub on improving kNN-based classification show potential for other high-dimensional data analysis tasks.

Published

2025-04-11

How to Cite

Dong, H., Zeng, L., Zhao, Z., Silvestri, F., & Pham, N. (2025). On Finding Hubs in High Dimensions with Sampling. Proceedings of the AAAI Conference on Artificial Intelligence, 39(11), 11590–11597. https://doi.org/10.1609/aaai.v39i11.33261

Issue

Section

AAAI Technical Track on Data Mining & Knowledge Management I