logoAlways On.

Technology

How we built a production-ready GraphRAG for RCA using AWS RDS

Production-grade root cause analysis and change management intelligence with an AWS-native graph engine

S

Spiros E.

Founder & CEO

5 min read
How we built a production-ready GraphRAG for RCA using AWS RDS

While the AI world was obsessing over large language models, we were solving a deeper infrastructure problem: how to build a Graph Retrieval-Augmented Generation (GraphRAG) system that actually works in production.

Six months before Google published its paper on Graph Foundation Models (GFMs), we had already deployed GraphRAG at scale on AWS. With over 9000 resources in production, we proved that relational data is a graph, and PostgreSQL on AWS RDS (Aurora) could power intelligent root cause analysis in real-time.

Here’s how we built the foundation of proactive observability using an AWS-native, graph-first architecture.

1. The $10 million problem: why traditional observability fails at scale

In modern cloud environments, metrics dashboards lie. Incidents aren't isolated — they're deeply relational.

Running thousands of services on AWS, we found that every incident was actually a graph traversal problem:

  • Service A depends on Database B
  • Database B serves Services C, D, E
  • Service C calls External API F
  • External API F affects Checkout Process G

Traditional tools show what changed, not how or why it spread. We traced $10M+ in lost revenue to false leads, slow incident resolution, and repeated regressions.

So we made a fundamental shift:

Operational data is a graph. Incidents are traversals. Root causes are paths.

This insight led us to adopt and implement GraphRAG but take it further by making it production-ready using PostgreSQL, an unconventional yet powerful graph substrate.

2. Why we chose PostgreSQL and AWS RDS over "graph databases"

We considered Neo4j, Neptune, and JanusGraph. But our requirements were production-grade:

  • Stable queries under incident pressure
  • SQL-native access for full-team fluency
  • No appetite for bespoke graph infrastructure

We chose PostgreSQL + pgvector on AWS RDS and it turned out to be the perfect engine.

-- Fast vector similarity for embedding-based node retrieval
SELECT entity_name, 1 - (embedding <=> query_vector) AS similarity
FROM observability_entities
ORDER BY embedding <=> query_vector
LIMIT 10;

Using SQL recursion (WITH RECURSIVE), we built multi-hop dependency traversals natively in PostgreSQL. Today, it forms the heart of our graph-native incident analysis engine.

3. Building GraphRAG that actually works in production

Our system combines vector search with graph reasoning:

  • Local Search: Embedding + 1-hop neighbor retrieval
  • Global Patterns: Recursive SQL for multi-hop paths
  • Real-Time Graph Construction: Entities become nodes; logs and dependencies become edges

Performance tuning (vector indexing, path caching, join filtering) allows us to handle live queries across 9000+ resources, all on AWS RDS.

4. GraphRAG vs Traditional RAG: What's the difference?

While traditional RAG retrieves flat documents using vector similarity, GraphRAG combines embeddings with relational paths to retrieve graph-contextual knowledge.

  • RAG = document chunks + LLM
  • GraphRAG = semantic paths + causality chains

This distinction is crucial for incident response, where connections, not content alone, determine insight quality.

5. How GraphRAG uses our PostgreSQL-based knowledge graph

At NOFire AI, our GraphRAG system is built on a continuously evolving knowledge graph, stored entirely in AWS RDS. This graph models the full causal structure of our infrastructure: services, Kubernetes workloads, databases, changes, alerts, investigations, incidents and postmortems.

We continuously ingest events and relationships to update this graph constantly, forming the substrate for intelligent reasoning.

But the knowledge graph itself is not the end goal. GraphRAG is how we activate it. It’s the retrieval interface that extracts the most relevant subgraphs for a specific context, whether that’s an incident, a performance regression, change management intelligence or a risky deployment.

Knowledge GraphGraphRAG
DefinitionStructured graph of entities, dependencies, and causal metadataRetrieval pattern that extracts relevant graph slices for a query or task
PurposeServes as persistent causal model of infrastructureServes as runtime reasoning engine for GenAI, incident triage, change analysis
StorageTables + edges + embeddings in PostgreSQLQueried live from PostgreSQL via vector + recursive SQL
Update CycleContinuously updated via pipelines, events, CI/CD dataConstructed per use case: alert, change, user query
OutputFull system graph (all entities and links)Subgraph most relevant to the current context (e.g. root cause path)
Integration with AISupports statistical causality, historical analysisPowers high-accuracy retrieval for LLM prompting + root cause reasoning

In short, the knowledge graph is the map and GraphRAG is the navigation system that finds the right path when it matters.

6. Google's Graph Foundation Models: The Validation

Google released their GFM paper, showing 3x–40x gains in ML tasks using graph-structured relational data. It validated what we’d already proven:

  • Relational data is best understood as a graph
  • Graph structures outperform flat schemas in ML

While GFMs learn representations, our GraphRAG retrieves insights in real-time. Both reject the false dichotomy of "flat" data and show that structure is the signal.

7. What Sets Our GraphRAG Architecture Apart

One of the most powerful aspects of our GraphRAG implementation is how it complements Causal AI and Generative AI in our platform. While most systems treat GenAI and causal reasoning as separate workflows, we've fused the two, powered by Causal AI models specifically designed for infrastructure graphs:

  • Causal AI identifies the likely causes and propagation patterns of incidents based on graph structure, topology, and prior behavior.
  • Generative AI turns those insights into actionable output: root cause summaries, recommended fixes, and postmortems.

This feedback loop improves over time and reinforces the principles we’ve explored in our previous writing, including Why GenAI Alone Won’t Fix Incident Response and related pieces on Causal AI and observability:

  • High-accuracy retrieval of root causes with explainability
  • Contextual generation grounded in infrastructure relationships
  • Reduced hallucinations from LLMs due to causal anchors

By embedding Causal AI models into our GraphRAG system and grounding GenAI in factual infrastructure relationships, we deliver high-accuracy incident and change management intelligence, explainable, actionable, and ready for real-time operations.

Three principles drive our edge:

  • Graph-Native Thinking: We model infrastructure as a living, evolving graph.
  • Production-First Architecture: Real-world scale, failover, ingestion — all handled on AWS.
  • GFM-Aligned Vision: Where foundation models generalize, GraphRAG operationalizes.

We didn’t build a foundation model — we built the retrieval substrate that makes them useful.

8. The Foundation for Proactive Observability and Change Intelligence

Key Results We've Seen with GraphRAG:

  • Reduced false positives by over 65% through causal correlation
  • Cut mean incident response time by 40–60% in high-priority alerts
  • Increased first-touch resolution accuracy by 3x
  • Reduced SRE toil and manual triage across 9000+ resources
  • Accelerated change intelligence workflows by retrieving and tracing relevant changes across teams and data sources up to 80% faster
  • Improved deployment confidence through real-time correlation of code, infra, and config shifts within a unified graph

These metrics signal a shift from reactive troubleshooting to predictive, graph-driven infrastructure intelligence.

GraphRAG enables a step-change in infrastructure maturity:

  • Simulate rollout impact before deployment
  • Detect slow-propagating failures across services
  • Map CI/CD commits to blast radius
  • Align incidents with business KPIs

It acts as the graph memory layer for intent-aware agents, automated mitigation, and next-gen operational readiness. We made causality computable, explainable, and scalable.

9. The Road Ahead: Autonomous Reasoning Is Next

We’re evolving from retrieval to reasoning:

  • Change agents that propose safe deployments
  • Copilots that surface fixes in Slack and Grafana IRM
  • Postmortems written from graph context
  • Simulated root cause using synthetic traces

All grounded in a SQL-powered, AWS-native graph stack.

Final Thought: The Relational Graph Awakening

We didn’t pick PostgreSQL because it was trendy. We picked it because it delivered:

  • High uptime and retrieval SLAs
  • Predictable cost at telemetry scale
  • Full alignment with our AWS-native architecture

The research world is catching up. GraphRAG is now real and PostgreSQL has proven itself capable of powering complex graph-native workloads in production. It’s the backbone of intelligent infrastructure.

Want to see what graph-native observability looks like in action? Book a demo with NOFire AI

Ready to experience faster incident resolution?

See how NOFire AI can help your team spend less time fighting fires and more time building features.