Towards Synthesizing High-Dimensional Tabular Data with Limited Samples

Authors

  • Zuqing Li The University of Melbourne
  • Junhao Gan The University of Melbourne
  • Jianzhong Qi The University of Melbourne

DOI:

https://doi.org/10.1609/aaai.v40i18.38545

Abstract

Diffusion-based tabular data synthesis models have yielded promising results. However, we observe that when the data dimensionality increases, existing models tend to degenerate and may perform even worse than simpler, non-diffusion-based models. This is because limited training samples in high-dimensional space often hinder generative models from capturing the distribution accurately. To mitigate the insufficient learning signals and to stabilize training under such conditions, we propose CtrTab, a condition-controlled diffusion model that injects perturbed ground-truth samples as auxiliary inputs during training. This design introduces an implicit $L_2$ regularization on the model’s sensitivity to the control signal, improving robustness and stability in high-dimensional, low-data scenarios. Experimental results across multiple datasets show that CtrTab outperforms state-of-the-art models, with a performance gap in accuracy over 90% on average.

Published

2026-03-14

How to Cite

Li, Z., Gan, J., & Qi, J. (2026). Towards Synthesizing High-Dimensional Tabular Data with Limited Samples. Proceedings of the AAAI Conference on Artificial Intelligence, 40(18), 15207-15215. https://doi.org/10.1609/aaai.v40i18.38545

Issue

Section

AAAI Technical Track on Data Mining & Knowledge Management II