Intermediate N-Gramming: Deterministic and Fast N-Grams for Large N and Large Datasets

Authors

  • Ryan R. Curtin Booz Allen Hamilton
  • Fred Lu University of Maryland, Baltimore County
  • Edward Raff CrowdStrike University of Maryland, Baltimore County
  • Priyanka Ranade Laboratory for Physical Sciences

DOI:

https://doi.org/10.1609/aaai.v40i17.38482

Abstract

The number of n-gram features grows exponentially in n, making it computationally demanding to compute the most frequent n-grams even for n as small as 3. Motivated by our production machine learning system built on n-gram features, we ask: is it possible to accurately, deterministically, and quickly recover the top-k most frequent n-grams? We devise a multi-pass algorithm called Intergrams that constructs candidate n-grams from the preceding (n-1)-grams. By designing this algorithm with hardware in mind, our approach yields more than an order of magnitude speedup (up to 33x!) over the next known fastest algorithm, even when similar optimization are applied to the other algorithm. Using the empirical power-law distribution over n-grams, we also provide theory to inform the efficacy of our multi-pass approach.

Downloads

Published

2026-03-14

How to Cite

Curtin, R. R., Lu, F., Raff, E., & Ranade, P. (2026). Intermediate N-Gramming: Deterministic and Fast N-Grams for Large N and Large Datasets. Proceedings of the AAAI Conference on Artificial Intelligence, 40(17), 14639–14647. https://doi.org/10.1609/aaai.v40i17.38482

Issue

Section

AAAI Technical Track on Data Mining & Knowledge Management I