Scalable Vision-Language Understanding and Generation

Linchao Zhu

doi:10.1609/aaai.v39i27.35130

Scalable Vision-Language Understanding and Generation

Authors

Linchao Zhu Zhejiang University

DOI:

https://doi.org/10.1609/aaai.v39i27.35130

Abstract

Recent advances in vision-language models have shown remarkable potential, yet creating scalable systems that can effectively understand and generate across modalities remains challenging. This talk will present our contributions to advancing scalable vision-language systems, focusing on three key themes: (1) efficient vision-language understanding, including our work on temporal perceiving video-language pre-training and knowledge-enhanced zero-shot retrieval; (2) scalable generation frameworks, encompassing our innovations in zero-shot captioning and co-speech gesture generation; and (3) practical applications and deployments of these technologies. We will discuss how these advances have enabled both better performance and improved efficiency in real-world scenarios, and explore future directions for scalable multimodal systems.

AAAI-25 / IAAI-25 / EAAI-25 Proceedings Cover

Downloads

Published

2025-04-11

How to Cite

Zhu, L. (2025). Scalable Vision-Language Understanding and Generation. Proceedings of the AAAI Conference on Artificial Intelligence, 39(27), 28738–28738. https://doi.org/10.1609/aaai.v39i27.35130

Download Citation

Issue

Vol. 39 No. 27: AAAI-25 Special Track on AI for Social Impact, Senior Member Presentations, New Faculty Highlights, Journal Track

Section

New Faculty Highlights

Scalable Vision-Language Understanding and Generation

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information