A Learnable Discrete-Prior Fusion Autoencoder with Contrastive Learning for Tabular Data Synthesis

Authors

  • Rongchao Zhang Key Laboratory of High Confidence Software Technologies (Peking University), Ministry of Education, School of Computer Science, Peking University, Beijing, China
  • Yiwei Lou Key Laboratory of High Confidence Software Technologies (Peking University), Ministry of Education, School of Computer Science, Peking University, Beijing, China
  • Dexuan Xu School of Software & Microelectronics, Peking University, Beijing, China
  • Yongzhi Cao Key Laboratory of High Confidence Software Technologies (Peking University), Ministry of Education, School of Computer Science, Peking University, Beijing, China
  • Hanpin Wang Key Laboratory of High Confidence Software Technologies (Peking University), Ministry of Education, School of Computer Science, Peking University, Beijing, China
  • Yu Huang Key Laboratory of High Confidence Software Technologies (Peking University), Ministry of Education, School of Computer Science, Peking University, Beijing, China National Engineering Research Center for Software Engineering, Peking University, Beijing, China

DOI:

https://doi.org/10.1609/aaai.v38i15.29621

Keywords:

ML: Deep Learning Algorithms, APP: Security, ML: Applications, ML: Deep Neural Architectures and Foundation Models, ML: Privacy

Abstract

The actual collection of tabular data for sharing involves confidentiality and privacy constraints, leaving the potential risks of machine learning for interventional data analysis unsafely averted. Synthetic data has emerged recently as a privacy-protecting solution to address this challenge. However, existing approaches regard discrete and continuous modal features as separate entities, thus falling short in properly capturing their inherent correlations. In this paper, we propose a novel contrastive learning guided Gaussian Transformer autoencoder, termed GTCoder, to synthesize photo-realistic multimodal tabular data for scientific research. Our approach introduces a transformer-based fusion module that seamlessly integrates multimodal features, permitting for mining more informative latent representations. The attention within the fusion module directs the integrated output features to focus on critical components that facilitate the task of generating latent embeddings. Moreover, we formulate a contrastive learning strategy to implicitly constrain the embeddings from discrete features in the latent feature space by encouraging the similar discrete feature distributions closer while pushing the dissimilar further away, in order to better enhance the representation of the latent embedding. Experimental results indicate that GTCoder is effective to generate photo-realistic synthetic data, with interactive interpretation of latent embedding, and performs favorably against some baselines on most real-world and simulated datasets.

Published

2024-03-24

How to Cite

Zhang, R., Lou, Y., Xu, D., Cao, Y., Wang, H., & Huang, Y. (2024). A Learnable Discrete-Prior Fusion Autoencoder with Contrastive Learning for Tabular Data Synthesis. Proceedings of the AAAI Conference on Artificial Intelligence, 38(15), 16803-16811. https://doi.org/10.1609/aaai.v38i15.29621

Issue

Section

AAAI Technical Track on Machine Learning VI