Implementing Causal Observability for Practical Site Reliability Engineering in Cloud-Native Distributed Systems
Автор: Oreoluwa Omoike
Журнал: International Journal of Wireless and Microwave Technologies @ijwmt
Статья в выпуске: 2 Vol.16, 2026 года.
Бесплатный доступ
This paper presents a Causal Observability Framework designed to enhance the reliability and performance of cloud-native distributed systems through structured integration with the DevOps pipeline. The framework unifies three interdependent components: real-time telemetry collection, dual-domain causal tracing, and probabilistic causal inference. The causal tracing layer combines a time-domain vector autoregressive Granger causality model with a discrete Fourier transform frequency-domain extension. The causal inference layer employs Bayesian network propagation, updated online via the Expectation-Maximisation algorithm, to compute posterior downstream failure probabilities from upstream anomaly observations. Validation was conducted through a controlled, three-replicate experimental study on a seven-service AI-powered recommendation application deployed across a dual-provider six-node Kubernetes cluster (AWS EKS and GCP GKE) under three traffic profiles ranging from 50 to 500 requests per second. Against a conventional threshold-based monitoring baseline, the proposed framework achieved: a 35% reduction in incident response time (70 minutes to 45 minutes), a 40% reduction in mean time to recovery (50 minutes to 30 minutes), a 1.5 percentage-point improvement in system availability (98.0% to 99.5%), a 61% reduction in false-positive alert rate (18% to 7%), and a 63% improvement in root-cause localisation accuracy (54% to 88%). All five improvements were statistically significant at p < 0.05 via paired t-test. A quantified nine-minute early-warning lead time over conventional detection was demonstrated in the fault-injection scenario. Seven formal equations underpin the methodology, spanning Granger vector autoregression, F-test inference, AIC-based lag selection, normalised causality scoring, frequency-domain spectral causality, Bayesian posterior propagation, and expected detection lead time.
Causal Observability, Cloud-Native Systems, Granger Causality, Bayesian Networks, Kubernetes, Site Reliability Engineering, Predictive Monitoring, DevOps, OpenTelemetry, Cybersecurity
Короткий адрес: https://sciup.org/15020258
IDR: 15020258 | DOI: 10.5815/ijwmt.2026.02.02