Genomics Data Lossless Compression with (S, K)-Mer Encoding and Deep Neural Networks
DOI:
https://doi.org/10.1609/aaai.v39i12.33371Abstract
Learning-based compression shows competitive compression ratios for genomics data. It often includes three types of compressors: static, adaptive and semi-adaptive. However, these existing compressors suffer from inferior compression ratios or throughput, and adaptive compressors also faces model cold-start problems. To address these issues, we propose DeepGeCo, a novel genomics data lossless adaptive compression framework with (s,k)-mer encoding and deep neural networks, involving three compression modes (MINI for static, PLUS for adaptive, ULTRA for semi-adaptive) for flexible requirements of compression ratios or throughput. In DeepGeCo, (1) we develop BiGRU and Transformer as the backbone to build Warm-Start and Supporter models in terms of cold-start problems. (2) We introduce (s,k)-mer encoding to pre-process genomics data before feeding it into the DNN model for improve model throughput, and we propose a new metric - Ranking of Throughput and Compression Ratio (RTCR) for effective encoding parameters selection. (3) We design a threshold controller and a probabilistic mixer within the backbone to balance compression ratios and model throughput. Experiments on 10 real-world datasets show that DeepGeCo's three compression modes improve up to a 22.949X average throughput and up to a 31.095% average compression ratio improvement while occupying low CPU or GPU memory.Downloads
Published
2025-04-11
How to Cite
Sun, H., Yi, L., Ma, H., Sun, Y., Zheng, Y., Cui, W., … Liu, X. (2025). Genomics Data Lossless Compression with (S, K)-Mer Encoding and Deep Neural Networks. Proceedings of the AAAI Conference on Artificial Intelligence, 39(12), 12577–12585. https://doi.org/10.1609/aaai.v39i12.33371
Issue
Section
AAAI Technical Track on Data Mining & Knowledge Management II