SALSA: Semantically-Aware Latent Space Autoencoder

Kathryn E. Kirchoff; Travis Maxfield; Alexander Tropsha; Shawn M. Gomez

doi:10.1609/aaai.v38i12.29221

Authors

Kathryn E. Kirchoff Department of Computer Science, UNC Chapel Hill
Travis Maxfield Eshelman School of Pharmacy, UNC Chapel Hill
Alexander Tropsha Eshelman School of Pharmacy, UNC Chapel Hill
Shawn M. Gomez Department of Pharmacology, UNC Chapel Hill Joint Department of Biomedical Engineering at UNC Chapel Hill and NC State University

DOI:

https://doi.org/10.1609/aaai.v38i12.29221

Keywords:

ML: Representation Learning, ML: Unsupervised & Self-Supervised Learning, ML: Deep Generative Models & Autoencoders, ML: Applications

Abstract

In deep learning for drug discovery, molecular representations are often based on sequences, known as SMILES, which allow for straightforward implementation of natural language processing methodologies, one being the sequence-to-sequence autoencoder. However, we observe that training an autoencoder solely on SMILES is insufficient to learn molecular representations that are semantically meaningful, where semantics are specified by the structural (graph-to-graph) similarities between molecules. We demonstrate by example that SMILES-based autoencoders may map structurally similar molecules to distant codes, resulting in an incoherent latent space that does not necessarily respect the semantic similarities between molecules. To address this shortcoming we propose Semantically-Aware Latent Space Autoencoder (SALSA) for molecular representations: a SMILES-based transformer autoencoder modified with a contrastive task aimed at learning graph-to-graph similarities between molecules. To accomplish this, we develop a novel dataset comprised of sets of structurally similar molecules and opt for a supervised contrastive loss that is able to incorporate full sets of positive samples. We evaluate semantic awareness of SALSA representations by comparing to its ablated counterparts, and show empirically that SALSA learns representations that maintain 1) structural awareness, 2) physicochemical awareness, 3) biological awareness, and 4) semantic continuity.

SALSA: Semantically-Aware Latent Space Autoencoder

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information

Developed By

Subscription