Efficient Multi-Sample Approximate Computing for Scalable Analysis of Massive Distributed Datasets on Resource-Constrained Clusters

Authors

  • Alladoumbaye Ngueilbaye National Engineering Laboratory for Big Data System Computing Technology, Shenzhen University, China Big Data Institute, College of Computer Science and Software Engineering, Shenzhen University, 518060, China
  • Joshua Zhexue Huang National Engineering Laboratory for Big Data System Computing Technology, Shenzhen University, China Big Data Institute, College of Computer Science and Software Engineering, Shenzhen University, 518060, China Guangdong Laboratory of Artificial Intelligence and Digital Economy, Shenzhen, 518107, China
  • Yongda Cai National Engineering Laboratory for Big Data System Computing Technology, Shenzhen University, China Big Data Institute, College of Computer Science and Software Engineering, Shenzhen University, 518060, China
  • Xudong Sun College of Management, Shenzhen University, China

DOI:

https://doi.org/10.1609/aaaiss.v6i1.36030

Abstract

The prolific explosion of data in today's digital sphere by modern AI applications has created new challenges and opportunities for business industries. This has necessitated the development of scalable methods for analyzing massive datasets stored in distributed systems. However, resource-constrained clusters often struggle to process such datasets due to memory constraints and the computational overhead of distributed AI algorithms. This paper proposes efficient multi-sample approximate computing (EMSAC), a novel approach designed to enable scalable analysis of massive distributed datasets on small clusters with limited memory. EMSAC leverages multiple small random samples, processed in parallel using sequential algorithms, to approximate the analysis of the entire dataset. The approach has been implemented in Spark using the LOGO computing framework to address three key challenges: (1) efficient generation of multiple small random samples from a massive distributed dataset; (2) conversion of these data block samples to a partial RSP data model and parallel execution of sequential algorithms on the partial RSP data model to mine frequent itemsets; and (3) aggregation of each data block result to produce the approximate set of frequent itemsets of D. To guarantee the quality of random data block samples, we theoretically provide a bound on the estimated number of data blocks to be selected from the distributed data file. Empirical evaluations on synthetic and real-world datasets demonstrate that EMSAC outperforms traditional distributed and sampling-based approaches in terms of scalability, accuracy, and computational efficiency. The findings have shown that EMSAC is suitable for processing massive distributed data and generating accurate approximate frequent itemsets with constrained clusters.

Downloads

Published

2025-08-01

How to Cite

Ngueilbaye, A., Huang, J. Z., Cai, Y., & Sun, X. (2025). Efficient Multi-Sample Approximate Computing for Scalable Analysis of Massive Distributed Datasets on Resource-Constrained Clusters. Proceedings of the AAAI Symposium Series, 6(1), 66–66. https://doi.org/10.1609/aaaiss.v6i1.36030

Issue

Section

AI in Business: Intelligent Transformation and Management