L-Man: A Large Multi-modal Model Unifying Human-centric Tasks
DOI:
https://doi.org/10.1609/aaai.v39i10.33206Abstract
Large language models (LLMs) have recently shown notable progress in unifying various visual tasks with an open-ended form. However, when transferred to human-centric tasks, despite their remarkable multi-modal understanding ability in general domains, they lack further human-related domain knowledge and show unsatisfactory performance. Meanwhile, current human-centric unified models are mostly restricted to a pre-defined form and lack open-ended task capability. Therefore, it is necessary to propose a large multi-modal model which utilizes LLMs to unify various human-centric tasks. We forge ahead along this path from the aspects of dataset and model. Specifically, we first construct a large-scale language-image instruction-following dataset named HumanIns based on existing 20 open datasets from 6 diverse downstream tasks, which provides sufficient and diverse data to implement multi-modal training. Then, a model named L-Man including a query adapter is designed to extract the multi-grained semantics of image and align the cross-modal information between image and text. In practice, we introduce a two-stage training strategy, where the first stage extracts generic text-relevant visual information, and the second stage maps the visual features to the embedding space of the LLM. By tuning on HumanIns, our model shows significant superiority on human-centric tasks compared with existing large multi-modal models, and also achieves even better results on downstream datasets compared with respective task-specific models.Downloads
Published
2025-04-11
How to Cite
Zuo, J., Nie, Y., Guo, T., Zhang, H., Hong, J., Sang, N., Gao, C., & Han, K. (2025). L-Man: A Large Multi-modal Model Unifying Human-centric Tasks. Proceedings of the AAAI Conference on Artificial Intelligence, 39(10), 11095-11103. https://doi.org/10.1609/aaai.v39i10.33206
Issue
Section
AAAI Technical Track on Computer Vision IX