Kubernetes readiness and liveness probe failures
Kubernetes readiness and liveness probe failures cause two distinct problems: a failing liveness probe triggers a container restart, while a failing readiness probe removes the pod from the Service endpoints so it stops receiving traffic. Both are common root causes for CrashLoopBackOff restarts, 502/503 errors, and traffic drops during rolling deploys.
Liveness vs readiness vs startup probes
These three probe types answer different questions and have different consequences when they fail.
Liveness probe: "Is this container alive?" If it fails, Kubernetes kills and restarts the container. Use it to detect deadlocks or unrecoverable application states where the process is still running but cannot make progress.
Readiness probe: "Is this container ready to serve traffic?" If it fails, the pod is removed from the Service's endpoint list and traffic stops routing to it. Use it for startup completion checks, dependency health checks, and graceful overload handling.
Startup probe: "Has this container started yet?" It delays liveness and readiness checks until startup completes. Use it for slow-starting applications where a liveness probe would kill the container before it finishes initializing.
The startup probe is the right tool when your application has a variable or long startup time. Setting a high failureThreshold on the startup probe gives the container time to start without inflating initialDelaySeconds on the liveness probe.
How to diagnose in 60 seconds
Start with kubectl describe pod:
kubectl describe pod <pod-name> -n <namespace>Look at two sections of the output:
- Events at the bottom: look for
Unhealthy,Killing, andBackOffreasons. - Container state: check the restart count and
Last Statetermination reason.
Then pull recent cluster events sorted by time:
kubectl get events --sort-by='.lastTimestamp' -n <namespace>Common event messages and what they mean:
Liveness probe failed: HTTP probe failed with statuscode: 404— the probe path does not exist on this container. The application may have changed its health endpoint.Readiness probe failed: connection refused— the container is not yet listening on the probe port.initialDelaySecondsmay be too short.Unhealthywith an increasing restart count — the liveness probe is failing after startup, likely due to a timeout, misconfigured path, or CPU throttling.
Root causes
1. Wrong probe path or port
The most common cause. The application moved its health endpoint from /healthz to /health but the probe spec was not updated. To verify what the container actually exposes:
kubectl exec -it <pod-name> -n <namespace> -- curl localhost:8080/healthReplace 8080 and /health with the port and path your application uses. If this returns a non-200 response or connection refused, the probe path or port in the spec is wrong.
2. Probe timeout too aggressive
timeoutSeconds defaults to 1 second. A health endpoint that queries a database connection to verify readiness may take 2-3 seconds under normal load. The probe fails even though the application is healthy. You will see Readiness probe failed: context deadline exceeded or similar.
3. initialDelaySeconds too short
The container is probed before it finishes initializing. The liveness probe fails, Kubernetes kills the container, and the cycle repeats. This is one of the primary causes of CrashLoopBackOff on first deploy or after a pod reschedule.
This is especially common with:
- Java/JVM applications (30-90 second startup times are normal)
- Applications that run database migrations on startup
- Services that wait for a sidecar to be ready before accepting connections
4. CPU throttling causing probe timeouts
If a container is at or near its CPU limit, the health endpoint may not respond within timeoutSeconds. The probe fails even though the application is functioning. The failure appears intermittent and correlates with high-traffic periods.
Check CPU usage:
kubectl top pod <pod-name> --containers -n <namespace>If the container is consistently at its CPU limit, the probe timeouts are a symptom. The underlying cause is a CPU limit that is too low for the workload.
5. Dependency failure surfaced through readiness probe
The readiness probe checks a downstream dependency (database, cache, message broker) that is temporarily unavailable. The pod becomes unready and is removed from load balancing. This is the correct behavior when the dependency is genuinely down, but it can cause cascading traffic removal if many pods share the same failing dependency.
Fixes
Fix a wrong path or port
Update the probe spec in your Deployment or StatefulSet manifest:
livenessProbe:
httpGet:
path: /health # update to match what your app actually exposes
port: 8080
initialDelaySeconds: 15
timeoutSeconds: 5
failureThreshold: 3Apply the change:
kubectl apply -f <your-deployment.yaml>Fix timeout issues
Increase timeoutSeconds to give the health endpoint time to respond:
readinessProbe:
httpGet:
path: /ready
port: 8080
timeoutSeconds: 5
failureThreshold: 3
periodSeconds: 10With failureThreshold: 3 and periodSeconds: 10, the pod has 30 seconds of grace for transient slow responses before being marked unready. This is appropriate for health endpoints that check downstream dependencies.
Fix initialDelaySeconds
Base initialDelaySeconds on your observed worst-case cold start time, not a guess. Measure it:
kubectl describe pod <pod-name> | grep -A2 "Started"For Java applications, set initialDelaySeconds to at least 30-60 seconds. For a more robust approach, use a startup probe with a high failureThreshold instead of inflating initialDelaySeconds:
startupProbe:
httpGet:
path: /health
port: 8080
failureThreshold: 30 # 30 * 10s = 5 minutes maximum startup window
periodSeconds: 10
livenessProbe:
httpGet:
path: /health
port: 8080
timeoutSeconds: 5
failureThreshold: 3
periodSeconds: 10The liveness probe only activates after the startup probe succeeds. The container has up to 5 minutes to start before Kubernetes considers it failed.
Fix CPU throttling
If kubectl top pod --containers shows a container consistently at its CPU limit, you have two options:
- Increase the CPU limit in the container's resource spec.
- Reduce probe frequency: increase
periodSecondsso probes fire less often, reducing the probe's CPU contribution.
livenessProbe:
httpGet:
path: /health
port: 8080
periodSeconds: 30 # probe every 30s instead of the default 10s
timeoutSeconds: 5Prevention
Base initialDelaySeconds on measured startup time. The single most common mistake is setting initialDelaySeconds: 10 on an application that takes 45 seconds to start. Measure the actual startup time during development and set the value accordingly.
Separate liveness and readiness probes. They answer different questions and should have different thresholds. A readiness probe can be more sensitive (fail on slow dependency response) because it only stops traffic, not restarts. A liveness probe should be conservative (fail only on true deadlocks) because failure triggers a restart.
Test probes before you deploy. Run the container locally and curl the probe endpoint:
docker run --rm <your-image> &
curl -s -o /dev/null -w "%{http_code}" localhost:8080/healthIf it does not return 200, the probe will fail in the cluster.
Alert on restart count, not just CrashLoopBackOff. CrashLoopBackOff means Kubernetes has already backed off after repeated failures. A restart count above 3 in a rolling window is an early signal worth alerting on:
- alert: PodRestartingFrequently
expr: |
increase(kube_pod_container_status_restarts_total[1h]) > 3
for: 5m
labels:
severity: warningFor automated root-cause analysis that correlates probe failures with upstream signals, deployment changes, and resource constraints, see the AI SRE Benchmark to understand how NOFire AI approaches signal-to-root-cause accuracy across complex failure chains.
Related debugging guides
Probe failures connect to several other common failure modes:
- Debugging 502 and 503 errors — readiness probe failures are the most common cause of 502 errors in Kubernetes
- How to fix OOMKilled — CPU throttling and OOM share the same resource-constraint root cause
- CrashLoopBackOff: more than just a bad deployment — probe failures are the leading cause of CrashLoopBackOff
Frequently asked questions
- What is the difference between a liveness probe and a readiness probe?
- Liveness: if it fails, Kubernetes kills and restarts the container. Readiness: if it fails, Kubernetes stops sending traffic to the pod. A pod can be alive but not ready.
- Why does my pod keep restarting with a liveness probe failure?
- The most likely causes are a probe path that changed after a deploy, an initialDelaySeconds value that is too short for the container's startup time, or the container being CPU-throttled so the health endpoint cannot respond in time. Check kubectl describe pod for the exact failure message.
- Can a readiness probe failure cause 502 errors?
- Yes. When all pods in a Service fail their readiness probe, the Service has zero endpoints. Requests reach the load balancer but are not forwarded, returning 502 or 503.
Go deeper: the AI SRE Benchmark
Book a demo