L-Man: A Large Multi-modal Model Unifying Human-centric Tasks

Jialong Zuo; Ying Nie; Tianyu Guo; Huaxin Zhang; Jiahao Hong; Nong Sang; Changxin Gao; Kai Han

doi:10.1609/aaai.v39i10.33206

Authors

Jialong Zuo Huazhong University of Science and Technology
Ying Nie Huawei Noah's Ark Lab
Tianyu Guo Huawei Noah's Ark Lab
Huaxin Zhang Huazhong University of Science and Technology
Jiahao Hong Huazhong University of Science and Technology
Nong Sang Huazhong University of Science and Technology
Changxin Gao Huazhong University of Science and Technology
Kai Han Huawei Noah's Ark Lab

DOI:

https://doi.org/10.1609/aaai.v39i10.33206

Abstract

Large language models (LLMs) have recently shown notable progress in unifying various visual tasks with an open-ended form. However, when transferred to human-centric tasks, despite their remarkable multi-modal understanding ability in general domains, they lack further human-related domain knowledge and show unsatisfactory performance. Meanwhile, current human-centric unified models are mostly restricted to a pre-defined form and lack open-ended task capability. Therefore, it is necessary to propose a large multi-modal model which utilizes LLMs to unify various human-centric tasks. We forge ahead along this path from the aspects of dataset and model. Specifically, we first construct a large-scale language-image instruction-following dataset named HumanIns based on existing 20 open datasets from 6 diverse downstream tasks, which provides sufficient and diverse data to implement multi-modal training. Then, a model named L-Man including a query adapter is designed to extract the multi-grained semantics of image and align the cross-modal information between image and text. In practice, we introduce a two-stage training strategy, where the first stage extracts generic text-relevant visual information, and the second stage maps the visual features to the embedding space of the LLM. By tuning on HumanIns, our model shows significant superiority on human-centric tasks compared with existing large multi-modal models, and also achieves even better results on downstream datasets compared with respective task-specific models.

L-Man: A Large Multi-modal Model Unifying Human-centric Tasks

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information