NOFire.ai logo
Reliability Memory

Every action builds knowledge

Queryable knowledge of how your production behaves

2x
Faster Onboarding
80%
Less RCA Time

How organizational knowledge works

Four capabilities that build reliability memory that compounds over time

Searchable incident history

Search by service, symptom, or cause. Reuse patterns from similar past incidents instead of starting from scratch.

How can NOFire AI help you?

Find root causes faster, prevent incidents before they happen, and learn from every investigation.

What have we learned from past checkout-service incidents?

Show me fixes for similar issues in payment-service

Generate alert rules based on orders-service failures

Recommend SLOs for checkout-service based on production data

You can @ services or infrastructure by name

Observability audit

Audit your alerts and instrumentation. Get actionable fixes for overlapping alerts, missing labels, and threshold inconsistencies.

can you review my alert rules?
!
Critical Issues Detected
Analyzed 11 alerting rules and found 3 problems: overlapping alerts, missing labels, inconsistent thresholds

Critical Issues

1. ServiceHighErrorRate / PaymentServiceCritical — Overlapping
• Both fire for payment-service errors at 5% and 15% thresholds
• Different time windows (30s vs 90s) create redundant alerts
• Currently: Both firing at 100% error rate
2. Missing Required Labels
• 3 alerts missing customer_facing
• All alerts missing runbook_url
• Prevents proper SLO tracking and incident response
3. FrontendHighErrorCount — Inconsistent Threshold
• Uses absolute count (50,000) instead of error rate percentage
Status: RED — Critical overlapping alerts for payment service and missing required labels prevent proper SLO tracking

Instrumentation quality

Check OpenTelemetry trace quality. Find missing resource attributes, coverage gaps, and semantic convention issues.

check my OpenTelemetry instrumentation for orders-service
!
Critical instrumentation issues detected
Analyzed 20 traces across 2 services. Found missing resource attributes and incomplete trace coverage

Critical Issues

1. Missing service.version in CLI traces
• Present in HTTP traces but missing in CLI
• Inconsistent resource attribute configuration
Fix: Add to CLI instrumentation config
2. Limited CLI instrumentation coverage
• CLI: 3 spans per trace
• HTTP: 7+ spans per trace
• Missing database, external API, and middleware spans

Working Well

W3C TraceContext propagation across services
Proper error recording and status codes
All traces captured with proper relationships

Cost optimization (FinOps)

Analyze metric and trace usage patterns. Find cardinality issues and optimize retention policies without compromising reliability.

what about metrics cost improvements?
Analyzed 8,500 time series
53%storage reduction possible

Top Opportunities

1. traces_span_metrics_duration_milliseconds_bucket
70%
reduction
Current: 3,145 time series (highest cardinality)
Savings: 2,200 time series reduction
Action: Reduce histogram buckets from 26 to 8
Safe: Alerts use rate/sum calculations, not all buckets
2. container_blkio_device_usage_total
64%
reduction
Current: 628 time series (excessive granularity)
Savings: 400 time series reduction
Action: Aggregate to node-level only
Safe: Not in any alert rules, node-level sufficient
3. traces_service_graph retention
25%
storage
Action: Reduce retention from 90 to 30 days
Safe: Topology data rarely needed beyond 30 days
All RED/USE/SLO metrics preserved

Organizational knowledge in practice.

For developers

  • Faster resolution by reusing prior incidents and fixes
  • Less dependency on a few "incident historians"
  • Clear context on how a service has failed before

For SRE & platform

  • Single graph linking changes, incidents, and services
  • Ability to ask: "Where have we seen this pattern before?"
  • Stronger preventive guardrails informed by real history

For leadership

  • Reduced repeat incidents across teams and quarters
  • Reliability capability that compounds instead of resetting
  • Confidence that lessons from major incidents persist

Ready to build a reliability memory that never forgets?