Reliability Memory

Every action builds knowledge

Queryable knowledge of how your production behaves

Faster Onboarding

80%

Less RCA Time

Book a Demo Explore Knowledge

Code

Telemetry

Infrastructure

Incidents

Building Knowledge Graph

Every action builds knowledge

Incidents, changes, and patterns connect automatically. Your reliability memory grows with every event.

Total connections

1,247 links

Pattern clusters

24 found

Searchable Knowledge

Ask anything about your systems

"Similar incidents to cache-miss-2024-03"

"What changed before payments-service failed?"

847 incidents indexed

Growing daily

How organizational knowledge works

Four capabilities that build reliability memory that compounds over time

Searchable incident history

Search by service, symptom, or cause. Reuse patterns from similar past incidents instead of starting from scratch.

How can NOFire AI help you?

Find root causes faster, prevent incidents before they happen, and learn from every investigation.

What have we learned from past checkout-service incidents?

Show me fixes for similar issues in payment-service

Generate alert rules based on orders-service failures

Recommend SLOs for checkout-service based on production data

You can @ services or infrastructure by name

Observability audit

Audit your alerts and instrumentation. Get actionable fixes for overlapping alerts, missing labels, and threshold inconsistencies.

can you review my alert rules?

Critical Issues Detected

Analyzed 11 alerting rules and found 3 problems: overlapping alerts, missing labels, inconsistent thresholds

Critical Issues

1. ServiceHighErrorRate / PaymentServiceCritical — Overlapping

• Both fire for payment-service errors at 5% and 15% thresholds

• Different time windows (30s vs 90s) create redundant alerts

• Currently: Both firing at 100% error rate

2. Missing Required Labels

• 3 alerts missing customer_facing

• All alerts missing runbook_url

• Prevents proper SLO tracking and incident response

3. FrontendHighErrorCount — Inconsistent Threshold

• Uses absolute count (50,000) instead of error rate percentage

Status: RED — Critical overlapping alerts for payment service and missing required labels prevent proper SLO tracking

Instrumentation quality

Check OpenTelemetry trace quality. Find missing resource attributes, coverage gaps, and semantic convention issues.

check my OpenTelemetry instrumentation for orders-service

Critical instrumentation issues detected

Analyzed 20 traces across 2 services. Found missing resource attributes and incomplete trace coverage

Critical Issues

1. Missing service.version in CLI traces

• Present in HTTP traces but missing in CLI

• Inconsistent resource attribute configuration

• Fix: Add to CLI instrumentation config

2. Limited CLI instrumentation coverage

• CLI: 3 spans per trace

• HTTP: 7+ spans per trace

• Missing database, external API, and middleware spans

Working Well

W3C TraceContext propagation across services

Proper error recording and status codes

All traces captured with proper relationships

Cost optimization (FinOps)

Analyze metric and trace usage patterns. Find cardinality issues and optimize retention policies without compromising reliability.

what about metrics cost improvements?

Analyzed 8,500 time series

53%storage reduction possible

Top Opportunities

1. traces_span_metrics_duration_milliseconds_bucket

70%

reduction

• Current: 3,145 time series (highest cardinality)

• Savings: 2,200 time series reduction

• Action: Reduce histogram buckets from 26 to 8

Safe: Alerts use rate/sum calculations, not all buckets

2. container_blkio_device_usage_total

64%

reduction

• Current: 628 time series (excessive granularity)

• Savings: 400 time series reduction

• Action: Aggregate to node-level only

Safe: Not in any alert rules, node-level sufficient

3. traces_service_graph retention

25%

storage

• Action: Reduce retention from 90 to 30 days

Safe: Topology data rarely needed beyond 30 days

All RED/USE/SLO metrics preserved

Organizational knowledge in practice.

For developers

Faster resolution by reusing prior incidents and fixes
Less dependency on a few "incident historians"
Clear context on how a service has failed before

For SRE & platform

Single graph linking changes, incidents, and services
Ability to ask: "Where have we seen this pattern before?"
Stronger preventive guardrails informed by real history

For leadership

Reduced repeat incidents across teams and quarters
Reliability capability that compounds instead of resetting
Confidence that lessons from major incidents persist

Ready to build a reliability memory that never forgets?

See NOFire in action Explore all solutions