GuideGen: A Text-Guided Framework for Paired Full-torso Anatomy and CT Volume Generation

Authors

  • Linrui Dai Shanghai Jiao Tong University The University of Tokyo
  • Rongzhao Zhang Shanghai Artificial Intelligence Laboratory
  • Yongrui Yu Shanghai Jiao Tong University
  • Xiaofan Zhang Shanghai Jiao Tong University

DOI:

https://doi.org/10.1609/aaai.v40i5.37344

Abstract

The recently emerging conditional diffusion models seem promising for mitigating the labor and expenses in building large 3D medical imaging datasets. However, previous studies on 3D CT generation primarily focus on specific organs characterized by a local structure and fixed contrast and have yet to fully capitalize on the benefits of both semantic and textual conditions. In this paper, we present GuideGen, a controllable framework based on easily-acquired text prompts to generate anatomical masks and corresponding CT volumes for the entire torso—from chest to pelvis. Our approach includes three core components: a text-conditional semantic synthesizer for creating realistic full-torso anatomies; an anatomy-aware high-dynamic-range (HDR) autoencoder for high-fidelity feature extraction across varying intensity levels; and a latent feature generator that ensures alignment between CT images, anatomical semantics and input prompts. Combined, these components enable data synthesis for segmentation tasks from only textual instructions. To train and evaluate GuideGen, we compile a multi-modality cancer imaging dataset with paired CT and clinical descriptions from 12 public TCIA datasets and one private real-world dataset. Comprehensive evaluations across generation quality, cross-modality alignment, and data usability on multi-organ and tumor segmentation tasks demonstrate GuideGen's superiority over existing CT generation methods.

Published

2026-03-14

How to Cite

Dai, L., Zhang, R., Yu, Y., & Zhang, X. (2026). GuideGen: A Text-Guided Framework for Paired Full-torso Anatomy and CT Volume Generation. Proceedings of the AAAI Conference on Artificial Intelligence, 40(5), 3470–3478. https://doi.org/10.1609/aaai.v40i5.37344

Issue

Section

AAAI Technical Track on Computer Vision II