Efficient Multi-Sample Approximate Computing for Scalable Analysis of Massive Distributed Datasets on Resource-Constrained Clusters

Alladoumbaye Ngueilbaye; Joshua Zhexue Huang; Yongda Cai; Xudong Sun

doi:10.1609/aaaiss.v6i1.36030

Authors

Alladoumbaye Ngueilbaye National Engineering Laboratory for Big Data System Computing Technology, Shenzhen University, China Big Data Institute, College of Computer Science and Software Engineering, Shenzhen University, 518060, China
Joshua Zhexue Huang National Engineering Laboratory for Big Data System Computing Technology, Shenzhen University, China Big Data Institute, College of Computer Science and Software Engineering, Shenzhen University, 518060, China Guangdong Laboratory of Artificial Intelligence and Digital Economy, Shenzhen, 518107, China
Yongda Cai National Engineering Laboratory for Big Data System Computing Technology, Shenzhen University, China Big Data Institute, College of Computer Science and Software Engineering, Shenzhen University, 518060, China
Xudong Sun College of Management, Shenzhen University, China

DOI:

https://doi.org/10.1609/aaaiss.v6i1.36030

Abstract

The prolific explosion of data in today's digital sphere by modern AI applications has created new challenges and opportunities for business industries. This has necessitated the development of scalable methods for analyzing massive datasets stored in distributed systems. However, resource-constrained clusters often struggle to process such datasets due to memory constraints and the computational overhead of distributed AI algorithms. This paper proposes efficient multi-sample approximate computing (EMSAC), a novel approach designed to enable scalable analysis of massive distributed datasets on small clusters with limited memory. EMSAC leverages multiple small random samples, processed in parallel using sequential algorithms, to approximate the analysis of the entire dataset. The approach has been implemented in Spark using the LOGO computing framework to address three key challenges: (1) efficient generation of multiple small random samples from a massive distributed dataset; (2) conversion of these data block samples to a partial RSP data model and parallel execution of sequential algorithms on the partial RSP data model to mine frequent itemsets; and (3) aggregation of each data block result to produce the approximate set of frequent itemsets of D. To guarantee the quality of random data block samples, we theoretically provide a bound on the estimated number of data blocks to be selected from the distributed data file. Empirical evaluations on synthetic and real-world datasets demonstrate that EMSAC outperforms traditional distributed and sampling-based approaches in terms of scalability, accuracy, and computational efficiency. The findings have shown that EMSAC is suitable for processing massive distributed data and generating accurate approximate frequent itemsets with constrained clusters.

Efficient Multi-Sample Approximate Computing for Scalable Analysis of Massive Distributed Datasets on Resource-Constrained Clusters

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information