Resources/Benchmark

The AI SRE Benchmark.

How AI systems perform on causal root cause analysis, evaluated on RCAEval, a public fault-injection benchmark from ACM Web Conference 2025. Methodology, dataset, results.

N · 735 fault-injection scenarios

Baselines · 12 academic methods

Dataset · RCAEval (Figshare, CC-BY-4.0)

Published · ACM Web 2025

89%

NOFire · Top‑1 RCA

Correct root cause on first attempt. Evaluated on 735 fault-injection scenarios across three Kubernetes environments.

42%

Academic SOTA (GALA BARO M2)

Best published academic method on the same dataset. All baselines are peer-reviewed research, not commercial products.

2.1×

vs. academic state-of-the-art

Top-1 improvement over the best published academic baseline on RCAEval. Commercial AIOps platforms have not been independently evaluated.

02Results

Causal beats correlational.

Thirteen systems evaluated on 735 fault-injection scenarios from RCAEval. Top‑1 accuracy is the ranked first answer matching the injected root cause. Baselines are published academic methods, not commercial products.

System	Category	Top‑1	Top‑3	Top‑5
NOFire (Full Multi-Modal)	NOFire	89%	96%	99%
GALA BARO M2	Academic SOTA	42%	n/a	n/a
GALA BARO M1	Academic	38%	n/a	n/a
DiagFusion	Academic	35%	n/a	n/a
RCD	Academic	33%	n/a	n/a
CloudRanger	Academic	31%	n/a	n/a
MicroCause	Academic	28%	n/a	n/a
CausalRCA	Academic	27%	n/a	n/a
Nezha	Academic	9%	n/a	n/a

Full results and baseline implementations in §03. Dataset: RCAEval (Figshare, CC-BY-4.0). Baselines are academic methods; commercial AIOps platforms were not evaluated.

LLM provider impact.

All models use the same Production Context Graph. Only the LLM provider varies. The graph (not the language model) drives accuracy.

Model	Top‑1	Top‑3	Top‑5	Time (s)
Claude Haiku 3.5 (with BARO)	89%	97%	99%	21s
Claude Sonnet 3.5 (with BARO)	88%	96%	100%	40s
GPT-4o (with BARO)	88%	96%	99%	25s
GPT-4.1-mini (with BARO)	84%	97%	99%	16s
Claude Haiku 3.5 (no BARO)	70%	79%	86%	38s

03Methodology

Public dataset. Reproducible harness.

We evaluate on RCAEval, a publicly available fault-injection benchmark, not a proprietary or synthetic dataset. Every result is independently verifiable.

3.1 · Dataset

RCAEval: 735 fault-injection scenarios

Published at ACM Web Conference 2025. Fault injection across three Kubernetes environments with real traffic and full observability (metrics, logs, traces). Failure types include CPU throttling, memory leaks, network latency, container crashes, deployment errors, resource exhaustion, and database connection failures. Publicly available on Figshare under CC-BY-4.0. This is a controlled benchmark, not anonymized production incidents.

3.2 · Harness

Identical evaluation conditions

Every system receives the same telemetry inputs. No system sees the injected fault label. Output is the system's ranked hypothesis list. Top‑1 accuracy is the first hypothesis matching the injected root cause. Top‑3 and Top‑5 measure recall depth.

3.3 · Baselines

12 published academic methods

GALA, DiagFusion, RCD, CloudRanger, MicroCause, CausalRCA, PC-PR, BARO, TraceRCA, Granger-PR, Nezha, and one combined GALA BARO variant. All are peer-reviewed published papers. Implementations follow published specifications. Commercial AIOps platforms and vendor copilots were not evaluated in this benchmark.

3.4 · Observability context

Metrics alone are insufficient

Metrics only: 29% Top-1. Adding logs: 77%. Adding traces: 87%. Full multi-modal with BARO agentic reasoning: 89%. The Production Context Graph supports causal reasoning that individual signal types alone cannot.

3.5 · Reproducibility

Scripts available on request

Evaluation scripts and baseline implementations are available on request to enable independent verification. The benchmark dataset is publicly available on Figshare. We are working toward open-sourcing the full harness.

3.6 · Limitations

What this benchmark does not cover

RCAEval uses controlled fault injection, not real production incidents. Results reflect performance in a reproducible benchmark setting. Causal accuracy does not measure runtime safety, blast-radius prediction, or policy enforcement. MTTR and time-to-resolution claims are not reported here; those require separate production measurement.

04Disclosure

BARO method.

The paper “BARO: Robust Root Cause Analysis for Microservices via Multivariate Bayesian Online Change Point Detection” received the Best Artifact Award at ACM Web Conference 2025. NOFire uses BARO's agentic reasoning methodology as part of its Production Context Graph evaluation pipeline.

Benchmark dataset.

We used the RCAEval benchmark (publicly available on Figshare under CC-BY-4.0) to evaluate NOFire against 12 academic baselines. The dataset is controlled fault-injection, not anonymized production telemetry.

LLM training data.

RCAEval is a public dataset published in 2025. It is possible that frontier LLMs were exposed to it during pre-training. To account for this, we compare NOFire against academic baselines (not LLM-only systems) and vary the LLM provider to isolate the contribution of the Production Context Graph from the language model. The consistency across providers (89%, 88%, 88%) suggests the graph, not any specific model, is the primary accuracy driver.

What we have not tested.

Commercial AIOps platforms and vendor copilots were not evaluated. MTTR improvement is not reported; that requires longitudinal production measurement. Remediation quality, blast-radius prediction, and policy enforcement are evaluated separately.

Authors

Panagiotis Moustafellos · Spiros Economakis · Miles Pham · NOFire Research.

Dataset

RCAEval · ACM Web Conference 2025 · Figshare · CC-BY-4.0

BARO

Best Artifact Award, ACM Web Conference 2025. NOFire incorporates BARO's Bayesian reasoning as one component of the Production Context Graph evaluation pipeline.

Inside the report

Full methodology, ablations, per-class results (deploys, dependencies, configs, capacity), dataset access protocol, and the harness reference implementation.

See the runtime model behind the numbers.

A 30‑minute call with a founder. Same Production Context Graph that scored 89% on RCAEval, mapped to your stack.

Book a demo Contact us