How we built a production-ready GraphRAG for RCA using AWS RDS

While the AI world was obsessing over large language models, we were solving a deeper infrastructure problem: how to build a Graph Retrieval-Augmented Generation (GraphRAG) system that actually works in production.

Six months before Google published its paper on Graph Foundation Models (GFMs), we had already deployed GraphRAG at scale on AWS. With over 9000 resources in production, we proved that relational data is a graph, and PostgreSQL on AWS RDS (Aurora) could power intelligent root cause analysis in real-time.

Here’s how we built the foundation of proactive observability using an AWS-native, graph-first architecture.

1. The $10 million problem: why traditional observability fails at scale

In modern cloud environments, metrics dashboards lie. Incidents aren't isolated — they're deeply relational.

Running thousands of services on AWS, we found that every incident was actually a graph traversal problem:

Service A depends on Database B
Database B serves Services C, D, E
Service C calls External API F
External API F affects Checkout Process G

Traditional tools show what changed, not how or why it spread. We traced $10M+ in lost revenue to false leads, slow incident resolution, and repeated regressions.

So we made a fundamental shift:

Operational data is a graph. Incidents are traversals. Root causes are paths.

This insight led us to adopt and implement GraphRAG but take it further by making it production-ready using PostgreSQL, an unconventional yet powerful graph substrate.

2. Why we chose PostgreSQL and AWS RDS over "graph databases"

We considered Neo4j, Neptune, and JanusGraph. But our requirements were production-grade:

Stable queries under incident pressure
SQL-native access for full-team fluency
No appetite for bespoke graph infrastructure

We chose PostgreSQL + pgvector on AWS RDS and it turned out to be the perfect engine.

-- Fast vector similarity for embedding-based node retrieval
SELECT entity_name, 1 - (embedding <=> query_vector) AS similarity
FROM observability_entities
ORDER BY embedding <=> query_vector
LIMIT 10;

Using SQL recursion (WITH RECURSIVE), we built multi-hop dependency traversals natively in PostgreSQL. Today, it forms the heart of our graph-native incident analysis engine.

3. Building GraphRAG that actually works in production

Our system combines vector search with graph reasoning:

Local Search: Embedding + 1-hop neighbor retrieval
Global Patterns: Recursive SQL for multi-hop paths
Real-Time Graph Construction: Entities become nodes; logs and dependencies become edges

Performance tuning (vector indexing, path caching, join filtering) allows us to handle live queries across 9000+ resources, all on AWS RDS.

4. GraphRAG vs Traditional RAG: What's the difference?

While traditional RAG retrieves flat documents using vector similarity, GraphRAG combines embeddings with relational paths to retrieve graph-contextual knowledge.

RAG = document chunks + LLM
GraphRAG = semantic paths + causality chains

This distinction is crucial for incident response, where connections, not content alone, determine insight quality.

5. How GraphRAG uses our PostgreSQL-based knowledge graph

At NOFire AI, our GraphRAG system is built on a continuously evolving knowledge graph, stored entirely in AWS RDS. This graph models the full causal structure of our infrastructure: services, Kubernetes workloads, databases, changes, alerts, investigations, incidents and postmortems.

We continuously ingest events and relationships to update this graph constantly, forming the substrate for intelligent reasoning.

But the knowledge graph itself is not the end goal. GraphRAG is how we activate it. It’s the retrieval interface that extracts the most relevant subgraphs for a specific context, whether that’s an incident, a performance regression, change management intelligence or a risky deployment.

	Knowledge Graph	GraphRAG
Definition	Structured graph of entities, dependencies, and causal metadata	Retrieval pattern that extracts relevant graph slices for a query or task
Purpose	Serves as persistent causal model of infrastructure	Serves as runtime reasoning engine for GenAI, incident triage, change analysis
Storage	Tables + edges + embeddings in PostgreSQL	Queried live from PostgreSQL via vector + recursive SQL
Update Cycle	Continuously updated via pipelines, events, CI/CD data	Constructed per use case: alert, change, user query
Output	Full system graph (all entities and links)	Subgraph most relevant to the current context (e.g. root cause path)
Integration with AI	Supports statistical causality, historical analysis	Powers high-accuracy retrieval for LLM prompting + root cause reasoning

In short, the knowledge graph is the map and GraphRAG is the navigation system that finds the right path when it matters.

6. Google's Graph Foundation Models: The Validation

Google released their GFM paper, showing 3x–40x gains in ML tasks using graph-structured relational data. It validated what we’d already proven:

Relational data is best understood as a graph
Graph structures outperform flat schemas in ML

While GFMs learn representations, our GraphRAG retrieves insights in real-time. Both reject the false dichotomy of "flat" data and show that structure is the signal.

7. What Sets Our GraphRAG Architecture Apart

One of the most powerful aspects of our GraphRAG implementation is how it complements Causal AI and Generative AI in our platform. While most systems treat GenAI and causal reasoning as separate workflows, we've fused the two, powered by Causal AI models specifically designed for infrastructure graphs:

Causal AI identifies the likely causes and propagation patterns of incidents based on graph structure, topology, and prior behavior.
Generative AI turns those insights into actionable output: root cause summaries, recommended fixes, and postmortems.

This feedback loop improves over time and reinforces the principles we’ve explored in our previous writing, including Why GenAI Alone Won’t Fix Incident Response and related pieces on Causal AI and observability:

High-accuracy retrieval of root causes with explainability
Contextual generation grounded in infrastructure relationships
Reduced hallucinations from LLMs due to causal anchors

By embedding Causal AI models into our GraphRAG system and grounding GenAI in factual infrastructure relationships, we deliver high-accuracy incident and change management intelligence, explainable, actionable, and ready for real-time operations.

Three principles drive our edge:

Graph-Native Thinking: We model infrastructure as a living, evolving graph.
Production-First Architecture: Real-world scale, failover, ingestion — all handled on AWS.
GFM-Aligned Vision: Where foundation models generalize, GraphRAG operationalizes.

We didn’t build a foundation model — we built the retrieval substrate that makes them useful.

8. The Foundation for Proactive Observability and Change Intelligence

Key Results We've Seen with GraphRAG:

Reduced false positives by over 65% through causal correlation
Cut mean incident response time by 40–60% in high-priority alerts
Increased first-touch resolution accuracy by 3x
Reduced SRE toil and manual triage across 9000+ resources
Accelerated change intelligence workflows by retrieving and tracing relevant changes across teams and data sources up to 80% faster
Improved deployment confidence through real-time correlation of code, infra, and config shifts within a unified graph

These metrics signal a shift from reactive troubleshooting to predictive, graph-driven infrastructure intelligence.

GraphRAG enables a step-change in infrastructure maturity:

Simulate rollout impact before deployment
Detect slow-propagating failures across services
Map CI/CD commits to blast radius
Align incidents with business KPIs

It acts as the graph memory layer for intent-aware agents, automated mitigation, and next-gen operational readiness. We made causality computable, explainable, and scalable.

9. The Road Ahead: Autonomous Reasoning Is Next

We’re evolving from retrieval to reasoning:

Change agents that propose safe deployments
Copilots that surface fixes in Slack and Grafana IRM
Postmortems written from graph context
Simulated root cause using synthetic traces

All grounded in a SQL-powered, AWS-native graph stack.

Final Thought: The Relational Graph Awakening

We didn’t pick PostgreSQL because it was trendy. We picked it because it delivered:

High uptime and retrieval SLAs
Predictable cost at telemetry scale
Full alignment with our AWS-native architecture

The research world is catching up. GraphRAG is now real and PostgreSQL has proven itself capable of powering complex graph-native workloads in production. It’s the backbone of intelligent infrastructure.

Want to see what graph-native observability looks like in action? Book a demo with NOFire AI