How we built a production-ready GraphRAG for RCA using AWS RDS
Production-grade root cause analysis and change management intelligence with an AWS-native graph engine
Spiros E.
Founder & CEO

Production-grade root cause analysis and change management intelligence with an AWS-native graph engine
Spiros E.
Founder & CEO
While the AI world was obsessing over large language models, we were solving a deeper infrastructure problem: how to build a Graph Retrieval-Augmented Generation (GraphRAG) system that actually works in production.
Six months before Google published its paper on Graph Foundation Models (GFMs), we had already deployed GraphRAG at scale on AWS. With over 9000 resources in production, we proved that relational data is a graph, and PostgreSQL on AWS RDS (Aurora) could power intelligent root cause analysis in real-time.
Here’s how we built the foundation of proactive observability using an AWS-native, graph-first architecture.
In modern cloud environments, metrics dashboards lie. Incidents aren't isolated — they're deeply relational.
Running thousands of services on AWS, we found that every incident was actually a graph traversal problem:
Traditional tools show what changed, not how or why it spread. We traced $10M+ in lost revenue to false leads, slow incident resolution, and repeated regressions.
So we made a fundamental shift:
Operational data is a graph. Incidents are traversals. Root causes are paths.
This insight led us to adopt and implement GraphRAG but take it further by making it production-ready using PostgreSQL, an unconventional yet powerful graph substrate.
We considered Neo4j, Neptune, and JanusGraph. But our requirements were production-grade:
We chose PostgreSQL + pgvector on AWS RDS and it turned out to be the perfect engine.
-- Fast vector similarity for embedding-based node retrieval
SELECT entity_name, 1 - (embedding <=> query_vector) AS similarity
FROM observability_entities
ORDER BY embedding <=> query_vector
LIMIT 10;
Using SQL recursion (WITH RECURSIVE), we built multi-hop dependency traversals natively in PostgreSQL. Today, it forms the heart of our graph-native incident analysis engine.
Our system combines vector search with graph reasoning:
Performance tuning (vector indexing, path caching, join filtering) allows us to handle live queries across 9000+ resources, all on AWS RDS.
While traditional RAG retrieves flat documents using vector similarity, GraphRAG combines embeddings with relational paths to retrieve graph-contextual knowledge.
This distinction is crucial for incident response, where connections, not content alone, determine insight quality.
At NOFire AI, our GraphRAG system is built on a continuously evolving knowledge graph, stored entirely in AWS RDS. This graph models the full causal structure of our infrastructure: services, Kubernetes workloads, databases, changes, alerts, investigations, incidents and postmortems.
We continuously ingest events and relationships to update this graph constantly, forming the substrate for intelligent reasoning.
But the knowledge graph itself is not the end goal. GraphRAG is how we activate it. It’s the retrieval interface that extracts the most relevant subgraphs for a specific context, whether that’s an incident, a performance regression, change management intelligence or a risky deployment.
Knowledge Graph | GraphRAG | |
---|---|---|
Definition | Structured graph of entities, dependencies, and causal metadata | Retrieval pattern that extracts relevant graph slices for a query or task |
Purpose | Serves as persistent causal model of infrastructure | Serves as runtime reasoning engine for GenAI, incident triage, change analysis |
Storage | Tables + edges + embeddings in PostgreSQL | Queried live from PostgreSQL via vector + recursive SQL |
Update Cycle | Continuously updated via pipelines, events, CI/CD data | Constructed per use case: alert, change, user query |
Output | Full system graph (all entities and links) | Subgraph most relevant to the current context (e.g. root cause path) |
Integration with AI | Supports statistical causality, historical analysis | Powers high-accuracy retrieval for LLM prompting + root cause reasoning |
In short, the knowledge graph is the map and GraphRAG is the navigation system that finds the right path when it matters.
Google released their GFM paper, showing 3x–40x gains in ML tasks using graph-structured relational data. It validated what we’d already proven:
While GFMs learn representations, our GraphRAG retrieves insights in real-time. Both reject the false dichotomy of "flat" data and show that structure is the signal.
One of the most powerful aspects of our GraphRAG implementation is how it complements Causal AI and Generative AI in our platform. While most systems treat GenAI and causal reasoning as separate workflows, we've fused the two, powered by Causal AI models specifically designed for infrastructure graphs:
This feedback loop improves over time and reinforces the principles we’ve explored in our previous writing, including Why GenAI Alone Won’t Fix Incident Response and related pieces on Causal AI and observability:
By embedding Causal AI models into our GraphRAG system and grounding GenAI in factual infrastructure relationships, we deliver high-accuracy incident and change management intelligence, explainable, actionable, and ready for real-time operations.
Three principles drive our edge:
We didn’t build a foundation model — we built the retrieval substrate that makes them useful.
Key Results We've Seen with GraphRAG:
These metrics signal a shift from reactive troubleshooting to predictive, graph-driven infrastructure intelligence.
GraphRAG enables a step-change in infrastructure maturity:
It acts as the graph memory layer for intent-aware agents, automated mitigation, and next-gen operational readiness. We made causality computable, explainable, and scalable.
We’re evolving from retrieval to reasoning:
All grounded in a SQL-powered, AWS-native graph stack.
We didn’t pick PostgreSQL because it was trendy. We picked it because it delivered:
The research world is catching up. GraphRAG is now real and PostgreSQL has proven itself capable of powering complex graph-native workloads in production. It’s the backbone of intelligent infrastructure.
Want to see what graph-native observability looks like in action? Book a demo with NOFire AI
See how NOFire AI can help your team spend less time fighting fires and more time building features.