Intermediate N-Gramming: Deterministic and Fast N-Grams for Large N and Large Datasets

Ryan R. Curtin; Fred Lu; Edward Raff; Priyanka Ranade

doi:10.1609/aaai.v40i17.38482

Intermediate N-Gramming: Deterministic and Fast N-Grams for Large N and Large Datasets

Authors

Ryan R. Curtin Booz Allen Hamilton
Fred Lu University of Maryland, Baltimore County
Edward Raff CrowdStrike University of Maryland, Baltimore County
Priyanka Ranade Laboratory for Physical Sciences

DOI:

https://doi.org/10.1609/aaai.v40i17.38482

Abstract

The number of n-gram features grows exponentially in n, making it computationally demanding to compute the most frequent n-grams even for n as small as 3. Motivated by our production machine learning system built on n-gram features, we ask: is it possible to accurately, deterministically, and quickly recover the top-k most frequent n-grams? We devise a multi-pass algorithm called Intergrams that constructs candidate n-grams from the preceding (n-1)-grams. By designing this algorithm with hardware in mind, our approach yields more than an order of magnitude speedup (up to 33x!) over the next known fastest algorithm, even when similar optimization are applied to the other algorithm. Using the empirical power-law distribution over n-grams, we also provide theory to inform the efficacy of our multi-pass approach.

AAAI-26 / IAAI-26 / EAAI-26 Proceedings Cover

Downloads

Published

2026-03-14

How to Cite

Curtin, R. R., Lu, F., Raff, E., & Ranade, P. (2026). Intermediate N-Gramming: Deterministic and Fast N-Grams for Large N and Large Datasets. Proceedings of the AAAI Conference on Artificial Intelligence, 40(17), 14639–14647. https://doi.org/10.1609/aaai.v40i17.38482

Download Citation

Issue

Vol. 40 No. 17: AAAI-26 Technical Tracks 17

Section

AAAI Technical Track on Data Mining & Knowledge Management I

Intermediate N-Gramming: Deterministic and Fast N-Grams for Large N and Large Datasets

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information