CAD-GPT: Synthesising CAD Construction Sequence with Spatial Reasoning-Enhanced Multimodal LLMs

Authors

  • Siyu Wang School of Electronics Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai, China Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai, China
  • Cailian Chen School of Electronics Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai, China Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai, China SJTU-Paris Elite Institute of Technology, Shanghai Jiao Tong University, Shanghai, China
  • Xinyi Le School of Electronics Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai, China Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai, China
  • Qimin Xu School of Electronics Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai, China Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai, China
  • Lei Xu Institute of Cyber Science and Technology, Shanghai Jiao Tong University, Shanghai, China Shanghai Key Laboratory of Integrated Administration Technologies for Information Security, Shanghai, China
  • Yanzhou Zhang School of Electronics Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai, China Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai, China
  • Jie Yang University of Minnesota Twin Cities, Saint Paul, MN, USA

DOI:

https://doi.org/10.1609/aaai.v39i8.32849

Abstract

Computer-aided design (CAD) significantly enhances the efficiency, accuracy, and innovation of design processes by enabling precise 2D and 3D modeling, extensive analysis, and optimization. Existing methods for creating CAD models rely on latent vectors or point clouds, which are difficult to obtain, and storage costs are substantial. Recent advances in Multimodal Large Language Models (MLLMs) have inspired researchers to use natural language instructions and images for CAD model construction. However, these models still struggle with inferring accurate 3D spatial location and orientation, leading to inaccuracies in determining the spatial 3D starting points and extrusion directions for constructing geometries. This work introduces CAD-GPT, a CAD synthesis method with spatial reasoning-enhanced MLLM that takes either a single image or a textual description as input. To achieve precise spatial inference, our approach introduces a 3D Modeling Spatial Mechanism. This method maps 3D spatial positions and 3D sketch plane rotation angles into a 1D linguistic feature space using a specialized spatial unfolding mechanism, while discretizing 2D sketch coordinates into an appropriate planar space to enable precise determination of spatial starting position, sketch orientation, and 2D sketch coordinate translations. Extensive experiments demonstrate that CAD-GPT consistently outperforms existing state-of-the-art methods in CAD model synthesis, both quantitatively and qualitatively.

Downloads

Published

2025-04-11

How to Cite

Wang, S., Chen, C., Le, X., Xu, Q., Xu, L., Zhang, Y., & Yang, J. (2025). CAD-GPT: Synthesising CAD Construction Sequence with Spatial Reasoning-Enhanced Multimodal LLMs. Proceedings of the AAAI Conference on Artificial Intelligence, 39(8), 7880–7888. https://doi.org/10.1609/aaai.v39i8.32849

Issue

Section

AAAI Technical Track on Computer Vision VII