Causal vs correlation in root cause analysis

Correlation-based RCA identifies signals that moved together during an incident. Causal RCA traces the dependency graph to find the origin event that triggered the failure chain. The difference matters because correlated signals are often downstream effects, not causes, and fixing the symptom leaves the root cause in place.

The problem with correlation

Correlation tools see: latency spiked on service A at 14:03. A deploy happened at 14:02. They surface the deploy as a suspect. That sounds useful until you learn that three services deployed at 14:02.

The one that caused the problem had a schema migration that triggered retry storms on service B, which then saturated service A's connection pool. Correlation gets you the timestamp. Causation gets you the migration.

This is the structural limitation of correlation-based approaches: they rank suspects by co-occurrence, not by causal mechanism. In a distributed system with concurrent changes and multiple propagation paths, co-occurrence is abundant. The actual cause is one specific path through the graph, and correlation cannot reliably distinguish it from noise.

How causal modeling works

Causal RCA builds a typed dependency graph that includes services, queues, databases, configuration, and deploys. When an incident occurs, the system replays the graph state at that moment and traces the path from the failure symptom back through the causal chain.

This lets it answer two questions that correlation cannot:

  • Which change propagated to this failure?
  • Through what path did it propagate?

The graph does the work that timestamps alone cannot do. A schema migration is connected to the service that owns the database, which is connected to the services that call it, which are connected to the services that depend on them. Replaying that graph under incident conditions produces a ranked list of hypotheses ordered by causal proximity to the failure, not by temporal coincidence.

The dependency graph also handles indirect causation. A configuration change that is not itself a service may still be the origin event if it altered the behavior of a component that sits upstream in the failure path. Correlation tools typically cannot surface this because the config change has no latency metric of its own to correlate.

The accuracy gap

The RCAEval benchmark (N=735 fault-injection scenarios, ACM 2025) measures Top-1 accuracy across RCA approaches. Probabilistic and correlation-based approaches reach 17-42% Top-1 accuracy. Causal approaches reach 89%.

That 47-72 percentage point gap is not an academic distinction. In an active incident, wrong hypotheses cost on-call hours. An on-call engineer who receives three wrong suspects before the correct one has spent time reproducing and dismissing each one. At 14:00 on a Friday, that difference is measurable in customer impact.

The benchmark uses fault injection across realistic microservice topologies, which means the scenarios include the multi-hop propagation and concurrent-change conditions where correlation struggles most. The gap is widest in those cases.

NOFire AI's approach to RCA is designed around causal graph traversal. For methodology detail and benchmark results, see the AI SRE Benchmark.

When correlation is good enough

Correlation is not always the wrong tool. For simple, isolated failures with a single causal chain, it often works. A single service goes down, a single deploy preceded it, no concurrent changes: correlation surfaces the right answer quickly.

The gap appears in two specific conditions:

Multiple concurrent changes. When several deploys, config pushes, or schema migrations happen within the same time window, correlation surfaces all of them as suspects. Without a graph to trace propagation, there is no principled way to rank them.

Multi-hop propagation. When a failure in component X causes a failure in component Y, which causes the observable symptom in component Z, correlation sees the symptom in Z and looks for changes near Z. The origin event in X may not correlate with Z's metrics at all, because the propagation path is indirect.

In practice, most production incidents in microservice architectures involve at least one of these conditions. That is why the benchmark gap is large: the benchmark is designed to reflect realistic incident conditions, not the simple single-cause cases where correlation already works.

A reasonable operational posture is to use correlation as a fast first pass for obvious failures, and causal graph traversal as the primary method for anything that does not resolve quickly. The cost of causal RCA is the graph itself: it requires maintained dependency data. The cost of relying only on correlation is a meaningful share of incidents that take longer to resolve than they should.

See the AI SRE Benchmark for the full methodology and accuracy data across RCA approaches.

Frequently asked questions

What is Top-1 accuracy in RCA?
Top-1 accuracy measures whether the correct root cause is the first hypothesis the tool returns. On RCAEval, random chance is near 1%; a useful tool needs to be reliably above 80%.
Can you do causal RCA without a dependency graph?
Not reliably. Causation inference requires knowing how components connect and how changes propagate. Without the graph, you are doing informed correlation.
Is correlation-based RCA useless?
No. It is fast and works well for simple failure modes. The issue is over-relying on it for complex distributed system failures.
Book a demo