Levenshtein Distance Embedding with Poisson Regression for DNA Storage

Authors

  • Xiang Wei Center for Applied Mathematics, Tianjin University, No. 92, Weijin Road, Tianjin, 300072, China
  • Alan J.X. Guo Center for Applied Mathematics, Tianjin University, No. 92, Weijin Road, Tianjin, 300072, China
  • Sihan Sun Center for Applied Mathematics, Tianjin University, No. 92, Weijin Road, Tianjin, 300072, China
  • Mengyi Wei Center for Applied Mathematics, Tianjin University, No. 92, Weijin Road, Tianjin, 300072, China
  • Wei Yu China Mobile Research Institute, No. 32, Xuanwumen West Street, Beijing, 100053, China

DOI:

https://doi.org/10.1609/aaai.v38i14.29509

Keywords:

ML: Optimization, APP: Natural Sciences, CSO: Applications, ML: Clustering

Abstract

Efficient computation or approximation of Levenshtein distance, a widely-used metric for evaluating sequence similarity, has attracted significant attention with the emergence of DNA storage and other biological applications. Sequence embedding, which maps Levenshtein distance to a conventional distance between embedding vectors, has emerged as a promising solution. In this paper, a novel neural network-based sequence embedding technique using Poisson regression is proposed. We first provide a theoretical analysis of the impact of embedding dimension on model performance and present a criterion for selecting an appropriate embedding dimension. Under this embedding dimension, the Poisson regression is introduced by assuming the Levenshtein distance between sequences of fixed length following a Poisson distribution, which naturally aligns with the definition of Levenshtein distance. Moreover, from the perspective of the distribution of embedding distances, Poisson regression approximates the negative log likelihood of the chi-squared distribution and offers advancements in removing the skewness. Through comprehensive experiments on real DNA storage data, we demonstrate the superior performance of the proposed method compared to state-of-the-art approaches.

Published

2024-03-24

How to Cite

Wei, X., Guo, A. J., Sun, S., Wei, M., & Yu, W. (2024). Levenshtein Distance Embedding with Poisson Regression for DNA Storage. Proceedings of the AAAI Conference on Artificial Intelligence, 38(14), 15796-15804. https://doi.org/10.1609/aaai.v38i14.29509

Issue

Section

AAAI Technical Track on Machine Learning V