Resources/Glossary

The SRE and AI glossary.

Definitions, guides, and debugging playbooks for site reliability engineering, AI agents in production, and the NOFire operating model.

SRE & AI SRE

Core concepts in site reliability engineering and the emerging AI SRE discipline.

Observability vs monitoring: what is the difference?

Monitoring checks predefined conditions and alerts when something known breaks. Observability lets you interrogate a system's internal state from its outputs so you can answer questions you did not think to ask in advance.

What are the DORA metrics?

DORA metrics are four measures of software delivery performance: deployment frequency, lead time for changes, change failure rate, and time to restore service. They are the standard benchmark for engineering team health, developed by Google's DevOps Research and Assessment program.

What is an AI SRE?

An AI SRE is an AI system that performs site reliability engineering tasks including incident detection, root-cause investigation, and remediation. This page explains how AI SREs work, where accuracy matters, and how they differ from AIOps.

What is Kubernetes?

Kubernetes (K8s) is an open-source platform for automating the deployment, scaling, and management of containerized applications. Originally designed at Google and open-sourced in 2014, it is the dominant orchestration platform for production workloads.

What is site reliability engineering (SRE)?

Site reliability engineering applies software engineering principles to infrastructure and operations to build scalable, reliable systems. It was created at Google in 2003 and uses SLOs, error budgets, and automation to treat reliability as a measurable product attribute.

Prevent

Catching failures before they reach production: change analysis, blast radius, DORA stability metrics.

Change failure rate, explained

Change failure rate (CFR) is one of the four DORA metrics, measuring the percentage of deployments that cause a production failure requiring remediation. Lower CFR means safer deployments and stronger pre-deploy validation.

What is blast radius analysis?

Blast radius analysis predicts the scope of impact if a proposed change or agent action fails, identifying which services, users, or transactions are affected. In AI agent governance, it is a runtime enforcement primitive that bounds what actions an agent is allowed to take.

What is proactive reliability?

Proactive reliability is an approach to production operations that focuses on preventing failures before they occur rather than responding after incidents fire. It requires pre-deploy risk assessment, blast-radius analysis, and a live model of production state.

Resolve

Root cause analysis, MTTR reduction, and the tools and techniques that shorten incident response.

Causal vs correlation in root cause analysis

Correlation-based RCA flags signals that moved together during an incident. Causal RCA traces the dependency graph to find the origin event that triggered the failure chain.

Debugging 502 and 503 errors in Kubernetes

HTTP 502 and 503 errors in Kubernetes signal that a proxy or load balancer could not reach a healthy upstream. This guide covers the most common causes and the exact kubectl commands to diagnose and fix them.

How to fix OOMKilled (exit code 137) in Kubernetes

OOMKilled (exit code 137) means the Linux kernel terminated your container for exceeding its memory limit. Learn how to diagnose the root cause and fix it permanently.

How to reduce MTTR in production

MTTR (mean time to restore) measures how quickly a team recovers from incidents. Reducing it requires faster detection, accurate root cause identification, pre-built remediations, and incident memory.

Kubernetes pod stuck in Pending state

A Kubernetes pod stuck in Pending means the scheduler cannot place it on any node. The most common causes are insufficient resources, taint/toleration mismatches, unbound PVCs, and unsatisfied node selector or affinity rules.

Kubernetes readiness and liveness probe failures

Kubernetes probe failures are a leading cause of CrashLoopBackOff, 502 errors, and traffic drops. Learn how to diagnose the exact failure type and fix it in minutes.

PostgreSQL deadlock detection and prevention

A PostgreSQL deadlock occurs when two or more transactions are each waiting for a lock held by the other, creating a cycle PostgreSQL resolves by terminating one transaction with error code 40P01. This guide covers detection, root causes, and prevention patterns.

What are root cause analysis tools?

Root cause analysis tools help engineering teams identify the underlying cause of a production incident, not just the symptoms. Learn how causal approaches outperform correlation-based tools and what to look for when evaluating RCA tooling.

What is alert fatigue?

Alert fatigue is the condition where on-call engineers become desensitized to monitoring alerts due to high volume, frequent false positives, and poor signal-to-noise ratio. It leads to slower response to real incidents, missed critical alerts, and engineer burnout.

What is causal AI?

Causal AI is a class of artificial intelligence that infers cause-and-effect relationships rather than identifying statistical patterns or correlations. For production operations, it means tracing a symptom back through a dependency graph to its origin event rather than surfacing related signals.

Remember

Post-incident learning, postmortems, and turning the fix at 2am into the fix the next time.

How to run a post-incident review

A post-incident review (PIR) is a structured meeting where the responding team reviews the timeline, confirms root cause, and assigns action items. This guide covers who should attend, how to run the meeting, and how to turn findings into durable memory.

Incident postmortem template

A structured incident postmortem template with sections for timeline, root cause, impact, and action items. Includes blameless framing guidance and tips for turning postmortems into durable organizational memory.

Govern

AI agent governance, policy as code, microVM isolation, and runtime enforcement for autonomous systems.

What is a microVM?

A microVM is a lightweight virtual machine that boots in milliseconds and provides hardware-level isolation for untrusted workloads, including AI agents. It combines the security boundary of a traditional VM with the speed and density of a container.

What is agentic AI?

Agentic AI systems act autonomously toward goals, planning multi-step sequences and taking actions rather than only generating text. In production operations this means agents that can deploy code, change configuration, and remediate incidents without human approval of each step.

What is AI agent governance?

AI agent governance is the set of controls, policies, and audit mechanisms that constrain what autonomous AI agents are allowed to do in production. It enforces policy at runtime, not only after the fact.

What is policy as code for production?

Policy as code means expressing operational, security, and compliance rules as version-controlled, machine-executable code rather than documents or manual review. In production, a policy engine evaluates every action against a defined rule set before it executes.

NOFire Concepts

NOFire-specific concepts: the Production Context Graph, the Context and Control Model, and the four jobs.

The production reliability mental model

The production reliability mental model separates what you know about your system from the system itself, applying three lenses: Topology, Knowledge, and State. Understanding this distinction is the foundation of effective incident response and AI-assisted root-cause analysis.

What is a Production Context Graph?

A Production Context Graph is a live, time-versioned graph of how your production services, deployments, configurations, dependencies, and incidents connect. It is the data structure that makes causal root cause analysis and runtime agent governance possible.

What is the Context and Control Model?

The Context and Control Model is a production operating framework that pairs a live, time-versioned model of how production behaves with a runtime enforcement layer that evaluates every autonomous action before it executes. Together they give AI agents the speed of automation and the safety of human judgment.