A live model of production. Enforcement for every action.

The Context & Control Model captures how your services, deploys, dependencies, and incidents connect across time. Every action by a human or agent is evaluated against it before it touches production.

Book a demo See how it works

NOFire.ai · Investigations

why is checkout failing? payment errors spiking

Ran 19 queries across 4 tools

PR #4471 raised Kafka consumer batch_size 500 to 2000, pushing memory to 597 MB (95% of the 600 MB limit). Checkout RPC p99 spiked 6x exactly 18 min after deploy.

Approve the rollbackAdd memory limit

Ask anything to steer…

Root causeEvidenceCausal chain

Hypotheses

Confirmed · hypothesis #1

PR #4471 (Kafka consumer batch 500 to 2000) breached memory limit, degrading checkout RPC

Deploy raised batch_size 4x. Kafka process reached 597 MB steady-state 18 min before RPC p99 spiked to 485 ms.

92%confidence

Supporting evidence

Kafka memory at 597 MB, 95% of 600 MB limit, sustained 38 min

RPC p99 spiked to 485 ms (6x baseline) starting 04:17:42

4 errors: rpc error: Payment request failed. Invalid token

Contradicting evidence

CPU stable, rules out compute saturation

No pod crashes. OOM threshold not reached

Consistent with memory pressure without full OOM

REJECTED · 6%

Checkout service CPU saturation under traffic spike

REJECTED · 2%

Upstream payment provider latency regression

Recommended actions1 ranked

Revert PR #4471: restore Kafka consumer batch_size to 500

RollbackLow riskone-click revertcheckout · 1 file

checkout-svc · config/kafka.yaml-2+2

consumer:

group_id: checkout-orders

- batch_size: 2000

- fetch_max_bytes: 8388608

+ batch_size: 500

+ fetch_max_bytes: 2097152

session_timeout_ms: 30000

kustomize/overlays/prod/checkout-svc/kafka-config.yaml

How it works.

Agents that investigate and prevent. Runbooks that fire on schedule, on event, or on demand. One execution path for every actor: human, CI, or agent.

NOFire Agents

Investigation

Tests multiple hypotheses in parallel. Verifies each with real evidence from your infra, code, telemetry, and change history.

Prevention

Pre-deploy blast-radius analysis and policy gates on every PR

Build your own Agents

Bring your own agents via MCP. They inherit the same production context, policy gates, and audit trail.

Runbooks

Scheduled

Weekly drift check · Monday health scan · post-deploy validation

Event-triggered

Grafana alert fired · PR merged · Slack mention

On-demand

/slash-command in Chat

Full audit trail · scoped per role

Trust Boundary

Durable Execution01

stateful · orchestrator · audit trail

Runtime02

stateless · policy enforcement at action-time

Execution Sandbox03

micro-VM isolation · one workload per instance

Who triggers

CI / CD

Engineer

Custom Agents

NOFire Agents

Every actor goes through the same gate. No privileged paths.

The platform.

Use our agents or bring your own.

Models

Works with the frontier models and cloud providers you already run. No new contracts, no new vendor risk. Your cloud, your keys, your compliance posture.

Cloud providers

AWS Bedrockyour contract

Azure OpenAIyour contract

Google Vertexyour contract

Frontier models

OpenAI

Causal

Your causal production graph maps every service, dependency, deploy, configuration change, and failure pattern by how they actually cause one another. Time-versioned and continuously updated.

Live · updated 14m ago

checkoutkube_deployment

api-gatewayapp_service

payment-svckube_deployment

postgresrds_instance

otel-collectorapp_service

4 incidents

5 engineer notes

Your Stack

Integrations across code, domain knowledge, infrastructure, observability, incident management, and CI/CD, plus custom MCPs and custom tools.

Code

GitHub

GitLab

Bitbucket

Infrastructure

AWS

GCP

Azure

Kubernetes

Telemetry

Datadog

Grafana

Prometheus

Elasticsearch

OpenSearch

Honeycomb

Loki

Tempo

Databases

PostgreSQL

MongoDB

Providers

AWS

GCP

Azure

Collab

Slack

PagerDuty

Atlassian

Linear

Live reads · INV-1380

Queried Prometheus

rate(http_5xx{service="checkout"}[5m])

+540% error rate · 1,243 errors/min

1.2s

Inspected pods

kubectl get pods -l app=checkout -n payments-prod

4/12 CrashLoopBackOff · OOMKilled

0.4s

Searched commits

path:services/cart --since=24h → abc1234

Memory limit 512M → 256M in last deploy

0.8s

Security & control

Define exactly what agents can do autonomously vs. what needs human approval. SOC 2 Type II, GDPR, and HIPAA aligned.

Read-only · always autonomous

Write · pending approval

Read logs & query metrics

rate(http_5xx{service="checkout"}[5m]) → +540% error rate

Auto

Revert a commit

Pending approval

Search code & docs

path:services/cart --since=24h → memory limit 512M → 256M

Auto

Restart a deployment

Pending approval

Analyse change events

deploy abc1234 payments-prod → memory limit 512M → 256M

Auto

Silence an alert

Pending approval

Query traces & spans

spans service=checkout error=true last_5m → 847 spans

Auto

Trigger a workflow

Pending approval

See your production through a causal lens.

A 30-minute call with a founder. We map your stack to the Context & Control Model, live.

Book a demo