What is site reliability engineering (SRE)?

Site reliability engineering (SRE) applies software engineering discipline to infrastructure and operations, with the goal of building systems that are both scalable and reliably measurable. Ben Treynor Sloss coined the term at Google in 2003 when he was tasked with running a production service as a software problem, not an operations one. The result was a practice built on service level objectives, error budgets, and systematic automation rather than heroics and tribal knowledge.

Where SRE came from

Google needed a way to scale operations without scaling headcount at the same rate. Treynor Sloss hired software engineers into the operations role and gave them a mandate: make the system reliable enough that it does not need constant human intervention. This reframing, operations as an engineering discipline, is the foundational insight of SRE.

The implications were significant. Reliability became a product attribute with a number attached to it, not a vague aspiration. Engineers owned it with the same accountability they owned feature delivery. And when something broke, the question was not who to blame but what to fix in the system so it would not break the same way again.

Core SRE concepts

Service level indicators, objectives, and agreements

A service level indicator (SLI) is a quantitative measure of a service's behavior: request latency, error rate, or availability, for example. A service level objective (SLO) is the target you set for that indicator: 99.9% of requests return in under 200ms over a 30-day window. A service level agreement (SLA) is the external commitment, typically with financial consequences, built on top of the SLO. SRE practice focuses on SLOs as the internal operating target; SLAs are the floor, not the goal.

Error budgets

The error budget is the complement of the SLO. If your SLO is 99.9% availability, your error budget is 0.1% of allowed downtime. Error budgets do two things. First, they give engineering teams explicit permission to take risk in pursuit of feature velocity, up to the limit. Second, when the budget is spent, they create a clear, non-negotiable signal: freeze risky deployments and invest in stability until reliability is restored. This removes the endless negotiation between product and engineering over whether it is safe to ship.

Toil reduction

Toil is manual, repetitive operational work that scales linearly with traffic and provides no lasting value. Restarting services, running manual health checks, rotating credentials by hand. SRE practice sets a target, typically capping toil at 50% of an engineer's time, and treats toil elimination as a first-class engineering project. Automation that eliminates toil is as valuable as a product feature.

Blameless postmortems

When incidents happen, SRE practice calls for a written postmortem that describes what occurred, the timeline, contributing factors, and follow-up action items. The emphasis is on systemic causes, not individual fault. A blameless culture is not about avoiding accountability; it is about creating an environment where engineers report incidents honestly and organizations learn from them reliably. Postmortems that assign blame produce cover stories, not learning.

SRE vs DevOps

DevOps is a philosophy and a set of cultural practices: break down silos between development and operations, ship faster, and share responsibility for production. It does not prescribe specific tools or metrics.

SRE is one rigorous implementation of that philosophy. It takes the DevOps principle of shared ownership and operationalizes it with concrete primitives: SLOs define what reliable looks like, error budgets govern how much risk is acceptable, and postmortems close the feedback loop after failures. You can practice DevOps without SRE. SRE, as defined by Google, is a specific version of DevOps done with engineering precision.

The two are not in competition. Many organizations run DevOps culture broadly while embedding SRE practices, particularly SLOs and error budgets, into their platform and reliability teams.

SRE in the age of AI agents

The classical SRE problem is a human engineer deploying code or making configuration changes that affect a running system. The reliability mechanisms, SLOs, change review, rollback procedures, were designed around that actor.

AI agents change the actor. When a machine is autonomously deploying code, scaling resources, or remediating incidents, it can act faster than any human review process and make thousands of changes in the time it takes a postmortem to be written. The reliability question shifts: how do you enforce reliability constraints on an actor that does not need to ask permission?

The answer requires a live causal model of production that understands which components are interdependent, what normal behavior looks like, and what the downstream blast radius of any change will be. It also requires runtime enforcement: policy that evaluates every agent action before it executes, not after the fact. Availability targets and error budgets do not disappear in an AI-native stack; they become harder to uphold without the right instrumentation.

For a detailed treatment of how reliability engineering adapts when AI agents enter the production loop, see the AI Reliability Guide. NOFire AI is built around this problem: root-cause analysis and runtime policy for systems where the change agent is increasingly autonomous.

Frequently asked questions

Is SRE the same as DevOps?
No. DevOps is a culture and set of collaboration practices. SRE is a specific engineering implementation of those principles, with concrete primitives like error budgets and SLOs that make reliability measurable and enforceable.
What does an SRE do day to day?
An SRE defines SLOs, automates repetitive operational toil, runs incident response, writes blameless postmortems, and engineers reliability directly into systems rather than reacting to failures after the fact.
What metrics do SRE teams track?
SRE teams track service level indicators (SLIs), service level objectives (SLOs), error budgets, and the four DORA metrics: deployment frequency, lead time for changes, change failure rate, and mean time to recovery.
Book a demo