The SRE and AI glossary.
Definitions, guides, and debugging playbooks for site reliability engineering, AI agents in production, and the NOFire operating model.
Core concepts in site reliability engineering and the emerging AI SRE discipline.
Catching failures before they reach production: change analysis, blast radius, DORA stability metrics.
Root cause analysis, MTTR reduction, and the tools and techniques that shorten incident response.
Causal vs correlation in root cause analysis
Debugging 502 and 503 errors in Kubernetes
How to fix OOMKilled (exit code 137) in Kubernetes
How to reduce MTTR in production
Kubernetes pod stuck in Pending state
Kubernetes readiness and liveness probe failures
PostgreSQL deadlock detection and prevention
What are root cause analysis tools?
What is alert fatigue?
What is causal AI?
Post-incident learning, postmortems, and turning the fix at 2am into the fix the next time.
AI agent governance, policy as code, microVM isolation, and runtime enforcement for autonomous systems.
NOFire-specific concepts: the Production Context Graph, the Context and Control Model, and the four jobs.