FoldToken: Learning Protein Language via Vector Quantization and Beyond

Zhangyang Gao; Cheng Tan; Jue Wang; Yufei Huang; Lirong Wu; Stan Z. Li

doi:10.1609/aaai.v39i1.31998

Authors

Zhangyang Gao Westlake University Zhejiang University
Cheng Tan Westlake University Zhejiang University
Jue Wang Westlake University Zhejiang University
Yufei Huang Westlake University Zhejiang University
Lirong Wu Westlake University Zhejiang University
Stan Z. Li Westlake University

DOI:

https://doi.org/10.1609/aaai.v39i1.31998

Abstract

Is there a foreign language describing protein sequences and structures simultaneously? Protein structures, represented by continuous 3D points, have long posed a challenge due to the contrasting modeling paradigms of discrete sequences. We introduce FoldTokenizer to represent protein sequence-structure as discrete symbols. This approach involves projecting residue types and structures into a discrete space, guided by a reconstruction loss for information preservation. We name the learned discrete symbols as FoldToken, and the sequence of FoldTokens serves as a new protein language, transforming the protein sequence-structure into a unified modality. We apply the created protein language on general backbone inpainting task, building the first GPT-style model (FoldGPT) for sequence-structure co-generation with promising results. Key to our success is the substantial enhancement of the vector quantization module, Soft Conditional Vector Quantization (SoftCVQ).

FoldToken: Learning Protein Language via Vector Quantization and Beyond

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information