Bridging Molecular Graphs and Large Language Models

Authors

  • Runze Wang Dalian University of Technology
  • Mingqi Yang National University of Singapore
  • Yanming Shen Dalian University of Technology

DOI:

https://doi.org/10.1609/aaai.v39i20.35422

Abstract

While Large Language Models (LLMs) have shown exceptional generalization capabilities, their ability to process graph data, such as molecular structures, remains limited. To bridge this gap, this paper proposes Graph2Token, an efficient solution that aligns graph tokens to LLM tokens. The key idea is to represent a graph token with the LLM token vocabulary, without fine-tuning the LLM backbone. To achieve this goal, we first construct a molecule-text paired dataset from multi-sources, including CHEBI and HMDB, to train a graph structure encoder, which reduces the distance between graphs and texts representations in the feature space. Then, we propose a novel alignment strategy that associates a graph token with LLM tokens. To further unleash the potential of LLMs, we collect molecular IUPAC name identifiers, which are incorporated into the LLM prompts. By aligning molecular graphs as special tokens, we can activate LLMs' generalization ability to molecular few-shot learning. Extensive experiments on molecular classification and regression tasks demonstrate the effectiveness of our proposed Graph2Token.

Downloads

Published

2025-04-11

How to Cite

Wang, R., Yang, M., & Shen, Y. (2025). Bridging Molecular Graphs and Large Language Models. Proceedings of the AAAI Conference on Artificial Intelligence, 39(20), 21234–21242. https://doi.org/10.1609/aaai.v39i20.35422

Issue

Section

AAAI Technical Track on Machine Learning VI