HiMo-CLIP: Modeling Semantic Hierarchy and Monotonicity in Vision-Language Alignment

Authors

  • Ruijia Wu 1. Data Science & Artificial Intelligence Research Institute, China Unicom 2. Unicom Data Intelligence, China Unicom
  • Ping Chen 1. Data Science & Artificial Intelligence Research Institute, China Unicom 2. Unicom Data Intelligence, China Unicom
  • Fei Shen National University of Singapore
  • Shaoan Zhao 1. Data Science & Artificial Intelligence Research Institute, China Unicom 2. Unicom Data Intelligence, China Unicom
  • Qiang Hui 1. Data Science & Artificial Intelligence Research Institute, China Unicom 2. Unicom Data Intelligence, China Unicom
  • Huanlin Gao 1. Data Science & Artificial Intelligence Research Institute, China Unicom 2. Unicom Data Intelligence, China Unicom
  • Ting Lu 1. Data Science & Artificial Intelligence Research Institute, China Unicom 2. Unicom Data Intelligence, China Unicom
  • Zhaoxiang Liu 1. Data Science & Artificial Intelligence Research Institute, China Unicom 2. Unicom Data Intelligence, China Unicom
  • Fang Zhao 1. Data Science & Artificial Intelligence Research Institute, China Unicom 2. Unicom Data Intelligence, China Unicom
  • Kai Wang 1. Data Science & Artificial Intelligence Research Institute, China Unicom 2. Unicom Data Intelligence, China Unicom
  • Shiguo Lian 1. Data Science & Artificial Intelligence Research Institute, China Unicom 2. Unicom Data Intelligence, China Unicom

DOI:

https://doi.org/10.1609/aaai.v40i32.39910

Abstract

Contrastive vision-language models like CLIP have achieved impressive results in image-text retrieval by aligning image and text representations in a shared embedding space. However, these models often treat text as flat sequences, limiting their ability to handle complex, compositional, and long-form descriptions. In particular, they fail to capture two essential properties of language: semantic hierarchy, which reflects the multi-level compositional structure of text, and semantic monotonicity, where richer descriptions should result in stronger alignment with visual content. To address these limitations, we propose HiMo-CLIP, a representation-level framework that enhances CLIP-style models without modifying the encoder architecture. HiMo-CLIP introduces two key components: a hierarchical decomposition (HiDe) module that extracts latent semantic components from long-form text via in-batch PCA, enabling flexible, batch-aware alignment across different semantic granularities, and a monotonicity-aware contrastive loss (MoLo) that jointly aligns global and component-level representations, encouraging the model to internalize semantic ordering and alignment strength as a function of textual completeness. These components work together to produce structured, cognitively aligned cross-modal representations. Experiments on multiple image-text retrieval benchmarks show that HiMo-CLIP consistently outperforms strong baselines, particularly under long or compositional descriptions.

Published

2026-03-14

How to Cite

Wu, R., Chen, P., Shen, F., Zhao, S., Hui, Q., Gao, H., … Lian, S. (2026). HiMo-CLIP: Modeling Semantic Hierarchy and Monotonicity in Vision-Language Alignment. Proceedings of the AAAI Conference on Artificial Intelligence, 40(32), 26974–26982. https://doi.org/10.1609/aaai.v40i32.39910

Issue

Section

AAAI Technical Track on Machine Learning IX