Points Meet Pixels: Bridging 2D Vision-Language Model and 3D Perception Gaps for Point Cloud Quality Assessment

Authors

  • Mingxuan Li Beijing Institute of Technology
  • Zihao Huang Beijing Institute of Technology
  • Xiaohui Chu Beijing Institute of Technology
  • Fazhan Zhang Beijing Institute of Technology
  • Bohan Fu Beijing Institute of Technology
  • Runze Hu Beijing Institute of Technology

DOI:

https://doi.org/10.1609/aaai.v40i8.37561

Abstract

Vision-Language Models (VLMs) have demonstrated significant progress in quality assessment tasks. However, a fundamental paradox arises when their application to Point Cloud Quality Assessment (PCQA). Existing VLMs, designed for image-text pairs, are inherently incompatible with 3D point cloud data due to the modality gap. While some PCQA research attempts to adapt point clouds to VLMs by 2D projection, this approach inevitably sacrifices crucial spatial structure information essential for accurate quality assessment. Conversely, directly integrating a dedicated 3D branch into a VLM-based PCQA framework introduces feature space misalignment and an influx of quality-insensitive information. To bridge these fundamental conflicts hindering VLMs' adaptation to PCQA, we propose the PMP-PCQA framework, which leverages the inherent mapping relationship between points and pixels to seamlessly apply VLMs to PCQA. Our approach introduces three key innovations: a Spatial Awareness Enhancer(SAE) module that enriches the image features with spatial coordinate clues to reinforce geometric awareness in 2D visual representations; a Fine-to-coarse Consistency Alignment(FCA) module that bridges the gap between 2D and 3D modalities by leveraging point-pixel correspondences to construct bridging features; and a Text-Guided Adaptive Miner(TAM) module that dynamically suppresses quality-insensitive features to mine discriminative visual clues for PCQA. Extensive evaluations demonstrate that PMP-PCQA consistently outperforms state-of-the-art methods across multiple benchmarks.

Downloads

Published

2026-03-14

How to Cite

Li, M., Huang, Z., Chu, X., Zhang, F., Fu, B., & Hu, R. (2026). Points Meet Pixels: Bridging 2D Vision-Language Model and 3D Perception Gaps for Point Cloud Quality Assessment. Proceedings of the AAAI Conference on Artificial Intelligence, 40(8), 6342–6350. https://doi.org/10.1609/aaai.v40i8.37561

Issue

Section

AAAI Technical Track on Computer Vision V