Root Cause Analysis in Microservice Using Neural Granger Causal Discovery

Authors

  • Cheng-Ming Lin National Yang Ming Chiao Tung University
  • Ching Chang National Yang Ming Chiao Tung University
  • Wei-Yao Wang National Yang Ming Chiao Tung University
  • Kuang-Da Wang National Yang Ming Chiao Tung University
  • Wen-Chih Peng National Yang Ming Chiao Tung University

DOI:

https://doi.org/10.1609/aaai.v38i1.27772

Keywords:

APP: Software Engineering, ML: Unsupervised & Self-Supervised Learning, RU: Causality

Abstract

In recent years, microservices have gained widespread adoption in IT operations due to their scalability, maintenance, and flexibility. However, it becomes challenging for site reliability engineers (SREs) to pinpoint the root cause due to the complex relationship in microservices when facing system malfunctions. Previous research employed structure learning methods (e.g., PC-algorithm) to establish causal relationships and derive root causes from causal graphs. Nevertheless, they ignored the temporal order of time series data and failed to leverage the rich information inherent in the temporal relationships. For instance, in cases where there is a sudden spike in CPU utilization, it can lead to an increase in latency for other microservices. However, in this scenario, the anomaly in CPU utilization occurs before the latency increases, rather than simultaneously. As a result, the PC-algorithm fails to capture such characteristics. To address these challenges, we propose RUN, a novel approach for root cause analysis using neural Granger causal discovery with contrastive learning. RUN enhances the backbone encoder by integrating contextual information from time series and leverages a time series forecasting model to conduct neural Granger causal discovery. In addition, RUN incorporates Pagerank with a personalization vector to efficiently recommend the top-k root causes. Extensive experiments conducted on the synthetic and real-world microservice-based datasets demonstrate that RUN noticeably outperforms the state-of-the-art root cause analysis methods. Moreover, we provide an analysis scenario for the sock-shop case to showcase the practicality and efficacy of RUN in microservice-based applications. Our code is publicly available at https://github.com/zmlin1998/RUN.

Downloads

Published

2024-03-25

How to Cite

Lin, C.-M., Chang, C., Wang, W.-Y., Wang, K.-D., & Peng, W.-C. (2024). Root Cause Analysis in Microservice Using Neural Granger Causal Discovery. Proceedings of the AAAI Conference on Artificial Intelligence, 38(1), 206-213. https://doi.org/10.1609/aaai.v38i1.27772

Issue

Section

AAAI Technical Track on Application Domains