A Robust and Extensible Tool for Data Integration Using Data Type Models

Authors

  • Andres Quiroz Palo Alto Research Center
  • Eric Huang Palo Alto Research Center
  • Luca Ceriani Palo Alto Research Center

DOI:

https://doi.org/10.1609/aaai.v29i2.19060

Abstract

Integrating heterogeneous data sets has been a significant barrier to many analytics tasks, due to the variety in structure and level of cleanliness of raw data sets requiring one-off ETL code. We propose HiperFuse, which significantly automates the data integration process by providing a declarative interface, robust type inference, extensible domain-specific data models, and a data integration planner which optimizes for plan completion time. The proposed tool is designed for schema-less data querying, code reuse within specific domains, and robustness in the face of messy unstructured data. To demonstrate the tool and its reference implementation, we show the requirements and execution steps for a use case in which IP addresses from a web clickstream log are joined with census data to obtain average income for particular site visitors (IPs), and offer preliminary performance results and qualitative comparisons to existing data integration and ETL tools.

Downloads

Published

2015-01-25

How to Cite

Quiroz, . A., Huang, E., & Ceriani, L. (2015). A Robust and Extensible Tool for Data Integration Using Data Type Models. Proceedings of the AAAI Conference on Artificial Intelligence, 29(2), 3993-3998. https://doi.org/10.1609/aaai.v29i2.19060