Rethink Representation Learning for Questionnaire Data

Guanhua Ye; Jifeng He; Yan Li; Junping Du; Zhe Xue; Yingxia Shao; Meiyu Liang; Yawen Li

doi:10.1609/aaai.v40i33.40000

Authors

Guanhua Ye Beijing University of Posts and Telecommunications
Jifeng He Beijing University of Posts and Telecommunications
Yan Li Beijing University of Posts and Telecommunications
Junping Du Beijing University of Posts and Telecommunications
Zhe Xue Beijing University of Posts and Telecommunications
Yingxia Shao Beijing University of Posts and Telecommunications
Meiyu Liang Beijing University of Posts and Telecommunications
Yawen Li Beijing University of Posts and Telecommunications

DOI:

https://doi.org/10.1609/aaai.v40i33.40000

Abstract

Questionnaire data serve as a valuable resource across numerous scientific domains, offering insights into human behavior, health, and social trends. Traditional downsampling-based representation learning methods—such as standardization and one-hot encoding—reformat these data into tabular structures that inherently discard semantic richness and obscure inter-sample and inter-feature relationships. Consequently, advanced deep learning models often underperform compared to simpler approaches like gradient-boosted decision trees (GBDT), due to their limited capacity to extract meaningful representations from semantically sparse inputs. To address this limitation, we introduce SemantiQ, a novel upsampling-based representation learning framework that embeds questionnaire responses into a unified semantic space. Leveraging Retrieval-Augmented Generation (RAG) in conjunction with large language models (LLMs), SemantiQ transforms question text, option text, and external knowledge into semantically enriched natural language statements. These statements are then encoded into semantic embeddings, which are further refined through a three-stage training mechanism and test-time training (TTT), enabling the model to capture complex sample- and feature-wise dependencies. Extensive experiments on multiple real-world datasets demonstrate that SemantiQ consistently outperforms state-of-the-art baselines.

Rethink Representation Learning for Questionnaire Data

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information