Genomics Data Lossless Compression with (S, K)-Mer Encoding and Deep Neural Networks

Authors

  • Hui Sun Nankai-Baidu Joint Laboratory (NBJL), College of Computer Science, Nankai University (NKU), Tianjin 300350, China.
  • Liping Yi Nankai-Baidu Joint Laboratory (NBJL), College of Computer Science, Nankai University (NKU), Tianjin 300350, China.
  • Huidong Ma Nankai-Baidu Joint Laboratory (NBJL), College of Computer Science, Nankai University (NKU), Tianjin 300350, China.
  • Yongxia Sun Nankai-Baidu Joint Laboratory (NBJL), College of Computer Science, Nankai University (NKU), Tianjin 300350, China.
  • Yingfeng Zheng Nankai-Baidu Joint Laboratory (NBJL), College of Computer Science, Nankai University (NKU), Tianjin 300350, China.
  • Wenwen Cui Nankai-Baidu Joint Laboratory (NBJL), College of Computer Science, Nankai University (NKU), Tianjin 300350, China.
  • Meng Yan Nankai-Baidu Joint Laboratory (NBJL), College of Computer Science, Nankai University (NKU), Tianjin 300350, China.
  • Gang Wang Nankai-Baidu Joint Laboratory (NBJL), College of Computer Science, Nankai University (NKU), Tianjin 300350, China.
  • Xiaoguang Liu Nankai-Baidu Joint Laboratory (NBJL), College of Computer Science, Nankai University (NKU), Tianjin 300350, China.

DOI:

https://doi.org/10.1609/aaai.v39i12.33371

Abstract

Learning-based compression shows competitive compression ratios for genomics data. It often includes three types of compressors: static, adaptive and semi-adaptive. However, these existing compressors suffer from inferior compression ratios or throughput, and adaptive compressors also faces model cold-start problems. To address these issues, we propose DeepGeCo, a novel genomics data lossless adaptive compression framework with (s,k)-mer encoding and deep neural networks, involving three compression modes (MINI for static, PLUS for adaptive, ULTRA for semi-adaptive) for flexible requirements of compression ratios or throughput. In DeepGeCo, (1) we develop BiGRU and Transformer as the backbone to build Warm-Start and Supporter models in terms of cold-start problems. (2) We introduce (s,k)-mer encoding to pre-process genomics data before feeding it into the DNN model for improve model throughput, and we propose a new metric - Ranking of Throughput and Compression Ratio (RTCR) for effective encoding parameters selection. (3) We design a threshold controller and a probabilistic mixer within the backbone to balance compression ratios and model throughput. Experiments on 10 real-world datasets show that DeepGeCo's three compression modes improve up to a 22.949X average throughput and up to a 31.095% average compression ratio improvement while occupying low CPU or GPU memory.

Downloads

Published

2025-04-11

How to Cite

Sun, H., Yi, L., Ma, H., Sun, Y., Zheng, Y., Cui, W., … Liu, X. (2025). Genomics Data Lossless Compression with (S, K)-Mer Encoding and Deep Neural Networks. Proceedings of the AAAI Conference on Artificial Intelligence, 39(12), 12577–12585. https://doi.org/10.1609/aaai.v39i12.33371

Issue

Section

AAAI Technical Track on Data Mining & Knowledge Management II