Sequence-Free for Compound Protein Interaction Prediction

Hongzhi Zhang; Jiameng Chen; Kun Li; Yida Xiong; Xiantao Cai; Wenbin Hu; Jia Wu

doi:10.1609/aaai.v40i19.38666

Authors

Hongzhi Zhang School of Computer Science, Wuhan University, Wuhan, China
Jiameng Chen School of Computer Science, Wuhan University, Wuhan, China
Kun Li School of Computer Science, Wuhan University, Wuhan, China
Yida Xiong School of Computer Science, Wuhan University, Wuhan, China
Xiantao Cai School of Computer Science, Wuhan University, Wuhan, China
Wenbin Hu Shenzhen Research Institute, Wuhan University, Shenzhen, China School of Computer Science, Wuhan University, Wuhan, China
Jia Wu Department of Computing, Macquarie University, Sydney, Australia

DOI:

https://doi.org/10.1609/aaai.v40i19.38666

Abstract

The prediction of compound–protein interactions (CPIs) is crucial for drug discovery. Most existing CPI prediction models rely on protein sequence information as input. However, in early-stage drug development, particularly in phenotype-driven studies or compound-response analyses, proteins are often annotated only with functional labels, and their sequences remain undetermined. Consequently, current methods are inapplicable in such scenarios. Furthermore, our experiments find that even when large-scale perturbations were applied to protein sequences, the predictive performance of the existing models did not show a significant decline. It indicates that the high investment in sequencing may not bring corresponding returns. To address the above issues, we propose an inexpensive, protein-sequencing-free framework BioText-CPI, based on the Biomedical Textual description of protein for CPI prediction. Firstly, during the pre-training stage of the model, we use contrastive learning to align protein texts and sequence modalities. Subsequently, we add biological text descriptions of proteins to the existing public CPI dataset to construct a new CPI dataset. Finally, in the CPI prediction stage, the sequence and biomedical text descriptions of proteins can be used as the input for CPI prediction either separately or simultaneously to meet the application requirements of different scenarios. The experiments demonstrate that BioText-CPI achieves comparable effects to the traditional methods when only the biomedical description of protein is input. Moreover, when the two modalities of protein information are input simultaneously, BioText-CPI achieves state-of-the-art performance across multiple scenarios.

Sequence-Free for Compound Protein Interaction Prediction

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information