Scalable and Interpretable Data Representation for High-Dimensional, Complex Data

Been Kim; Kayur Patel; Afshin Rostamizadeh; Julie Shah

doi:10.1609/aaai.v29i1.9474

Authors

Been Kim Massachusetts Institute of Technology
Kayur Patel Google
Afshin Rostamizadeh Google
Julie Shah Massachusetts Institute of Technology

DOI:

https://doi.org/10.1609/aaai.v29i1.9474

Abstract

The majority of machine learning research has been focused on building models and inference techniques with sound mathematical properties and cutting edge performance. Little attention has been devoted to the development of data representation that can be used to improve a user's ability to interpret the data and machine learning models to solve real-world problems. In this paper, we quantitatively and qualitatively evaluate an efficient, accurate and scalable feature-compression method using latent Dirichlet allocation for discrete data. This representation can effectively communicate the characteristics of high-dimensional, complex data points. We show that the improvement of a user's interpretability through the use of a topic modeling-based compression technique is statistically significant, according to a number of metrics, when compared with other representations. Also, we find that this representation is scalable --- it maintains alignment with human classification accuracy as an increasing number of data points are shown. In addition, the learned topic layer can semantically deliver meaningful information to users that could potentially aid human reasoning about data characteristics in connection with compressed topic space.

Scalable and Interpretable Data Representation for High-Dimensional, Complex Data

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information