From Natural Language to Executable ETL Flows: The IBM DataStage Assistant

Authors

  • Nitin Gupta IBM Research
  • Thomas Gschwind IBM Research
  • Shramona Chakraborty IBM Research
  • Sameep Mehta IBM Research
  • Tristan Tyler IBM Software
  • Shreya Sisodia IBM Software
  • Ben Clermont IBM Software

DOI:

https://doi.org/10.1609/aaai.v40i47.41429

Abstract

Modern ETL (Extract, Transform, Load) tools offer graphical, no-code interfaces for workflow creation but still require users to manually identify transformation functions and configure their properties, which is time-consuming and demands prior expertise. We present the research and engineering foundations of the IBM DataStage Assistant, a deployed capability that generates complete multi-stage ETL flows directly from natural language (NL) descriptions. Our framework infers transformation functions, their properties, and transformer expressions, enabling novices to discover relevant functions and allowing experts to bypass manual configuration. The proposed framework achieves a prediction accuracy of 96.4% for flow predictions, 87.0% for properties, and 83.6% for transformer expressions. We also show a document exploration module that uses retrieval-augmented generation (RAG) over product documentation to answer tool-specific questions in NL. Implemented in IBM DataStage, this approach supports iterative, in-environment workflow design and reduces context switching. In initial studies, it achieves up to 90% time savings for novices and 50% for experts.

Published

2026-03-14

How to Cite

Gupta, N., Gschwind, T., Chakraborty, S., Mehta, S., Tyler, T., Sisodia, S., & Clermont, B. (2026). From Natural Language to Executable ETL Flows: The IBM DataStage Assistant. Proceedings of the AAAI Conference on Artificial Intelligence, 40(47), 39949-39957. https://doi.org/10.1609/aaai.v40i47.41429

Issue

Section

IAAI Technical Track on Deployed Highly Innovative Applications of AI