Liveness probes sharing the cpu-bound thread pool keep killing your pods
Here's a fictional story:
An SRE at a B2B SaaS company watched the same payment service get killed every ninety seconds for two days. Liveness probes timing out. CPU pinned at 100 percent processing a backlog. Kubernetes did exactly what it was told, restart the pod, which dropped the in-flight work, which made the backlog worse. Classic restart loop dressed up as a health problem.
The team tuned timeoutSeconds up. Then failureThreshold. Then initialDelaySeconds. Probe still failed. Because the probe was riding the same CPU-bound worker pool as the real traffic. Under load, it always loses.
Here's the pattern I see people miss.
- Probes are not free. A liveness probe is just an HTTP call your app has to answer. If the handler shares the thread pool, goroutine pool, or event loop with the work that's saturating CPU, the probe gets queued behind real requests. Bumping the timeout doesn't fix queueing, it just delays the inevitable kill.
- Split the probe onto a dedicated lightweight handler. Different port, different listener, different executor. In Go that means a second
http.Serveron its own goroutine. In Java that means a separate Jetty connector with its own thread pool. In Node, a worker thread or at minimum a handler that does zero awaitable work. The probe should answer in single-digit milliseconds even when the main app is buried.
​
# main API on 8080, liveness on 8081 with its own server
livenessProbe:
httpGet:
path: /healthz
port: 8081
periodSeconds: 10
timeoutSeconds: 1
failureThreshold: 3
- Liveness and readiness do different jobs. Liveness answers "is this process wedged, restart me." Readiness answers "should I receive traffic right now." A pod chewing through a backlog is alive but not ready. Mark it
NotReadyso the Service stops sending new requests, but keep liveness green so Kubernetes doesn't murder it mid-batch. Two probes, two endpoints, two questions. - Liveness should test the process, not its dependencies. If
/healthzcalls Redis, Postgres, and three downstream APIs, you've built a network outage detector that uses pod restarts as its alert mechanism. When the database hiccups, every replica fails liveness at once and the cluster restarts your entire fleet. The liveness handler should return 200 if the process can serve a trivial request. That's it.
​
// liveness: dumb and fast
mux.HandleFunc("/healthz", func(w http.ResponseWriter, r *http.Request) {
w.WriteHeader(http.StatusOK)
})
// readiness: actual dependency checks
mux.HandleFunc("/ready", func(w http.ResponseWriter, r *http.Request) {
if !db.Ping() || queueDepth() > maxBacklog {
w.WriteHeader(http.StatusServiceUnavailable)
return
}
w.WriteHeader(http.StatusOK)
})
- Watch for restart loops in your metrics, not your logs.
kube_pod_container_status_restarts_totalgoing up whilecontainer_cpu_cfs_throttled_seconds_totalis also climbing is the signature of this exact bug. If you see both, your probe is competing with your work for CPU and losing. Don't tune timeouts. Move the probe.
Done right: liveness is a separate listener on a separate port doing nothing but returning 200. Readiness is where the dependency logic lives. Restarts become rare events that mean something is actually broken, not a Tuesday afternoon.
Curious whether anyone has war stories from the "we made liveness check the database" era.