Text-to-Scene with Large Reasoning Models

Authors

  • Frédéric Berdoz ETH Zurich
  • Luca A Lanzendörfer ETH Zurich
  • Nick Tuninga ETH Zurich
  • Roger Wattenhofer ETH Zurich

DOI:

https://doi.org/10.1609/aaai.v40i4.37229

Abstract

Prompt-driven scene synthesis allows users to generate complete 3D environments from textual descriptions. Current text-to-scene methods often struggle with complex geometries and object transformations, and tend to show weak adherence to complex instructions. We address these limitations by introducing Reason-3D, a text-to-scene model powered by large reasoning models (LRMs). Reason-3D integrates object retrieval using captions covering physical, functional, and contextual attributes. Reason-3D then places the selected objects based on implicit and explicit layout constraints, and refines their positions with collision-aware spatial reasoning. Evaluated on instructions ranging from simple to complex indoor configurations, Reason-3D significantly outperforms previous methods in human-rated visual fidelity, adherence to constraints, and asset retrieval quality. Beyond its contribution to the field of text-to-scene generation, our work showcases the advanced spatial reasoning abilities of modern LRMs. Additionally, we release the codebase to further the research in object retrieval and placement with LRMs.

Published

2026-03-14

How to Cite

Berdoz, F., Lanzendörfer, L. A., Tuninga, N., & Wattenhofer, R. (2026). Text-to-Scene with Large Reasoning Models. Proceedings of the AAAI Conference on Artificial Intelligence, 40(4), 2435-2443. https://doi.org/10.1609/aaai.v40i4.37229

Issue

Section

AAAI Technical Track on Computer Vision I