Fault Injection Based Interventional Causal Learning for Distributed Applications


  • Qing Wang IBM Global Chief Data Office
  • Jesus Rios IBM Research
  • Saurabh Jha IBM Research
  • Karthikeyan Shanmugam Google Research
  • Frank Bagehorn IBM Research
  • Xi Yang IBM Research
  • Robert Filepp IBM Research
  • Naoki Abe IBM Research
  • Larisa Shwartz IBM Research




Interventional Causal Learning, Fault Injection, Distributed Applications, Hybrid Cloud


We apply the machinery of interventional causal learning with programmable interventions to the domain of applications management. Modern applications are modularized into interdependent components or services (e.g. microservices) for ease of development and management. The communication graph among such components is a function of application code and is not always known to the platform provider. In our solution we learn this unknown communication graph solely using application logs observed during the execution of the application by using fault injections in a staging environment. Specifically, we have developed an active (or interventional) causal learning algorithm that uses the observations obtained during fault injections to learn a model of error propagation in the communication among the components. The “power of intervention” additionally allows us to address the presence of confounders in unobserved user interactions. We demonstrate the effectiveness of our solution in learning the communication graph of well-known microservice application benchmarks. We also show the efficacy of the solution on a downstream task of fault localization in which the learned graph indeed helps to localize faults at runtime in a production environment (in which the location of the fault is unknown). Additionally, we briefly discuss the implementation and deployment status of a fault injection framework which incorporates the developed technology.




How to Cite

Wang, Q., Rios, J., Jha, S., Shanmugam, K., Bagehorn, F., Yang, X., Filepp, R., Abe, N., & Shwartz, L. (2023). Fault Injection Based Interventional Causal Learning for Distributed Applications. Proceedings of the AAAI Conference on Artificial Intelligence, 37(13), 15738-15744. https://doi.org/10.1609/aaai.v37i13.26868



IAAI Technical Track on emerging Applications of AI