SALSA: Semantically-Aware Latent Space Autoencoder

Authors

  • Kathryn E. Kirchoff Department of Computer Science, UNC Chapel Hill
  • Travis Maxfield Eshelman School of Pharmacy, UNC Chapel Hill
  • Alexander Tropsha Eshelman School of Pharmacy, UNC Chapel Hill
  • Shawn M. Gomez Department of Pharmacology, UNC Chapel Hill Joint Department of Biomedical Engineering at UNC Chapel Hill and NC State University

DOI:

https://doi.org/10.1609/aaai.v38i12.29221

Keywords:

ML: Representation Learning, ML: Unsupervised & Self-Supervised Learning, ML: Deep Generative Models & Autoencoders, ML: Applications

Abstract

In deep learning for drug discovery, molecular representations are often based on sequences, known as SMILES, which allow for straightforward implementation of natural language processing methodologies, one being the sequence-to-sequence autoencoder. However, we observe that training an autoencoder solely on SMILES is insufficient to learn molecular representations that are semantically meaningful, where semantics are specified by the structural (graph-to-graph) similarities between molecules. We demonstrate by example that SMILES-based autoencoders may map structurally similar molecules to distant codes, resulting in an incoherent latent space that does not necessarily respect the semantic similarities between molecules. To address this shortcoming we propose Semantically-Aware Latent Space Autoencoder (SALSA) for molecular representations: a SMILES-based transformer autoencoder modified with a contrastive task aimed at learning graph-to-graph similarities between molecules. To accomplish this, we develop a novel dataset comprised of sets of structurally similar molecules and opt for a supervised contrastive loss that is able to incorporate full sets of positive samples. We evaluate semantic awareness of SALSA representations by comparing to its ablated counterparts, and show empirically that SALSA learns representations that maintain 1) structural awareness, 2) physicochemical awareness, 3) biological awareness, and 4) semantic continuity.

Downloads

Published

2024-03-24

How to Cite

Kirchoff, K. E., Maxfield, T., Tropsha, A., & Gomez, S. M. (2024). SALSA: Semantically-Aware Latent Space Autoencoder. Proceedings of the AAAI Conference on Artificial Intelligence, 38(12), 13211-13219. https://doi.org/10.1609/aaai.v38i12.29221

Issue

Section

AAAI Technical Track on Machine Learning III