The AI SRE Benchmark.
How AI systems perform on causal root cause analysis, evaluated on RCAEval, a public fault-injection benchmark from ACM Web Conference 2025. Methodology, dataset, results.
Correct root cause on first attempt. Evaluated on 735 fault-injection scenarios across three Kubernetes environments.
Best published academic method on the same dataset. All baselines are peer-reviewed research, not commercial products.
Top-1 improvement over the best published academic baseline on RCAEval. Commercial AIOps platforms have not been independently evaluated.
Causal beats correlational.
Thirteen systems evaluated on 735 fault-injection scenarios from RCAEval. Top‑1 accuracy is the ranked first answer matching the injected root cause. Baselines are published academic methods, not commercial products.
| System | Category | Top‑1 | Top‑3 | Top‑5 | |
|---|---|---|---|---|---|
| NOFire (Full Multi-Modal) | NOFire | 89% | 96% | 99% | |
| GALA BARO M2 | Academic SOTA | 42% | n/a | n/a | |
| GALA BARO M1 | Academic | 38% | n/a | n/a | |
| DiagFusion | Academic | 35% | n/a | n/a | |
| RCD | Academic | 33% | n/a | n/a | |
| CloudRanger | Academic | 31% | n/a | n/a | |
| MicroCause | Academic | 28% | n/a | n/a | |
| CausalRCA | Academic | 27% | n/a | n/a | |
| Nezha | Academic | 9% | n/a | n/a |
Full results and baseline implementations in §03. Dataset: RCAEval (Figshare, CC-BY-4.0). Baselines are academic methods; commercial AIOps platforms were not evaluated.
LLM provider impact.
All models use the same Production Context Graph. Only the LLM provider varies. The graph (not the language model) drives accuracy.
| Model | Top‑1 | Top‑3 | Top‑5 | Time (s) | |
|---|---|---|---|---|---|
| Claude Haiku 3.5 (with BARO) | 89% | 97% | 99% | 21s | |
| Claude Sonnet 3.5 (with BARO) | 88% | 96% | 100% | 40s | |
| GPT-4o (with BARO) | 88% | 96% | 99% | 25s | |
| GPT-4.1-mini (with BARO) | 84% | 97% | 99% | 16s | |
| Claude Haiku 3.5 (no BARO) | 70% | 79% | 86% | 38s |
Public dataset. Reproducible harness.
We evaluate on RCAEval, a publicly available fault-injection benchmark, not a proprietary or synthetic dataset. Every result is independently verifiable.
RCAEval: 735 fault-injection scenarios
Published at ACM Web Conference 2025. Fault injection across three Kubernetes environments with real traffic and full observability (metrics, logs, traces). Failure types include CPU throttling, memory leaks, network latency, container crashes, deployment errors, resource exhaustion, and database connection failures. Publicly available on Figshare under CC-BY-4.0. This is a controlled benchmark, not anonymized production incidents.
Identical evaluation conditions
Every system receives the same telemetry inputs. No system sees the injected fault label. Output is the system's ranked hypothesis list. Top‑1 accuracy is the first hypothesis matching the injected root cause. Top‑3 and Top‑5 measure recall depth.
12 published academic methods
GALA, DiagFusion, RCD, CloudRanger, MicroCause, CausalRCA, PC-PR, BARO, TraceRCA, Granger-PR, Nezha, and one combined GALA BARO variant. All are peer-reviewed published papers. Implementations follow published specifications. Commercial AIOps platforms and vendor copilots were not evaluated in this benchmark.
Metrics alone are insufficient
Metrics only: 29% Top-1. Adding logs: 77%. Adding traces: 87%. Full multi-modal with BARO agentic reasoning: 89%. The Production Context Graph supports causal reasoning that individual signal types alone cannot.
Scripts available on request
Evaluation scripts and baseline implementations are available on request to enable independent verification. The benchmark dataset is publicly available on Figshare. We are working toward open-sourcing the full harness.
What this benchmark does not cover
RCAEval uses controlled fault injection, not real production incidents. Results reflect performance in a reproducible benchmark setting. Causal accuracy does not measure runtime safety, blast-radius prediction, or policy enforcement. MTTR and time-to-resolution claims are not reported here; those require separate production measurement.
BARO method.
The paper “BARO: Robust Root Cause Analysis for Microservices via Multivariate Bayesian Online Change Point Detection” received the Best Artifact Award at ACM Web Conference 2025. NOFire uses BARO's agentic reasoning methodology as part of its Production Context Graph evaluation pipeline.
Benchmark dataset.
We used the RCAEval benchmark (publicly available on Figshare under CC-BY-4.0) to evaluate NOFire against 12 academic baselines. The dataset is controlled fault-injection, not anonymized production telemetry.
LLM training data.
RCAEval is a public dataset published in 2025. It is possible that frontier LLMs were exposed to it during pre-training. To account for this, we compare NOFire against academic baselines (not LLM-only systems) and vary the LLM provider to isolate the contribution of the Production Context Graph from the language model. The consistency across providers (89%, 88%, 88%) suggests the graph, not any specific model, is the primary accuracy driver.
What we have not tested.
Commercial AIOps platforms and vendor copilots were not evaluated. MTTR improvement is not reported; that requires longitudinal production measurement. Remediation quality, blast-radius prediction, and policy enforcement are evaluated separately.
Panagiotis Moustafellos · Spiros Economakis · Miles Pham · NOFire Research.
RCAEval · ACM Web Conference 2025 · Figshare · CC-BY-4.0
Best Artifact Award, ACM Web Conference 2025. NOFire incorporates BARO's Bayesian reasoning as one component of the Production Context Graph evaluation pipeline.
Full methodology, ablations, per-class results (deploys, dependencies, configs, capacity), dataset access protocol, and the harness reference implementation.
See the runtime model behind the numbers.
A 30‑minute call with a founder. Same Production Context Graph that scored 89% on RCAEval, mapped to your stack.