Kubernetes Probes in Depth: Liveness, Readiness, and Startup
A container can be running while your service is broken, still booting, or wedged in a way that only a restart fixes. Kubernetes does not guess—it asks on a schedule via probes. Liveness decides whether to restart a container. Readiness decides whether the Pod should receive traffic. Startup protects slow-boot apps from premature liveness kills. Misconfigure any of the three and you get mysterious restarts, black holes in load balancers, or deployments that never finish.
In short
Use readiness for “can this instance serve requests?” (dependencies up, migrations done). Use liveness only for “is this process irrecoverably stuck?” (deadlock, infinite loop)—not for dependency outages. Use startup when boot takes longer than your liveness budget. Point all probes at cheap, app-specific health endpoints; tune periodSeconds, timeoutSeconds, and failureThreshold from real startup and failure data, not chart defaults.
Why probes exist
Before orchestrators, operators SSH’d in and ran curl localhost/health. At scale that does not work. Kubernetes embeds health checking into the kubelet on every node: on a timer it runs an HTTP request, TCP dial, exec command, or gRPC health check against each container and records success or failure.
Those results drive control-plane behavior:
- The kubelet restarts containers when liveness fails (subject to restart policy).
- The kubelet sets
Pod.status.conditions[Ready]from readiness (and startup, while active). - Endpoints / EndpointSlice controllers only include Ready Pods in Service backends—so Ingress and
ClusterIPtraffic skip broken instances. - Deployment rollouts wait for new Pods to become Ready before scaling down old ones (with
maxUnavailable/maxSurgeand optionalminReadySeconds).
Probes are not monitoring—they are control signals. For dashboards and paging, use Prometheus, OpenTelemetry, or your APM. Probes should be fast, local, and conservative enough that a blip does not drain a node.
For cluster anatomy, see Kubernetes architecture in simple terms. For when probes fail in the wild, see K8s troubleshooting playbook.
The three probe types compared
| Probe | Question it answers | On failure | Affects Service endpoints? |
|---|---|---|---|
| Readiness | Should this Pod receive traffic right now? | Pod marked Not Ready; removed from Endpoints | Yes |
| Liveness | Is this container so broken that restart is the fix? | Container killed and restarted (per restartPolicy) | Indirectly (restart may drop Ready briefly) |
| Startup | Has the app finished booting yet? | Blocks liveness/readiness checks until success or cap | Readiness stays false until startup succeeds |
Readiness is about routing. Liveness is about process recovery. Startup is about boot time. Treating readiness like liveness—restarting because Postgres is down—is one of the most expensive mistakes teams make.
How the kubelet runs a probe
For each configured probe, the kubelet runs a loop:
- Wait
initialDelaySecondsafter container start (first probe only). - Execute the probe handler (HTTP GET, TCP socket open, exec in container, or gRPC).
- If the handler completes within
timeoutSecondsand reports success, increment success streak; else increment failure streak. - After
successThresholdconsecutive successes, mark probe Success. AfterfailureThresholdconsecutive failures, mark Failure and apply the probe-specific action. - Sleep
periodSecondsand repeat.
Defaults (if omitted): periodSeconds: 10, timeoutSeconds: 1, successThreshold: 1, failureThreshold: 3. That means three failures roughly 20–30 seconds apart can restart a container or drop it from load balancing—faster than many humans expect.
Probe handlers: HTTP, TCP, exec, gRPC
HTTP GET (most common)
readinessProbe:
httpGet:
path: /healthz
port: 8080
scheme: HTTP
httpHeaders:
- name: X-Health-Token
value: from-secret-via-env
initialDelaySeconds: 5
periodSeconds: 10
timeoutSeconds: 2
failureThreshold: 3
- Success: HTTP status 200–399 (inclusive). 4xx/5xx or connection errors count as failure.
- Use a dedicated path—
/healthz,/readyz—not your homepage with auth redirects. hostdefaults to Pod IP; rarely set tolocalhostunless you know why.
TCP socket
livenessProbe:
tcpSocket:
port: 5432
Opens a TCP connection to the port. Success means “something is listening,” not “Postgres accepts queries.” Fine for brokers with no HTTP admin port; weak for apps where the port is open but the process is wedged.
Exec
readinessProbe:
exec:
command:
- /bin/sh
- -c
- test -f /tmp/ready
Runs a command inside the container namespace. Exit code 0 = success. Powerful but easy to make slow or flaky; every exec forks a process on the node path into the container.
gRPC (Kubernetes 1.24+)
livenessProbe:
grpc:
port: 50051
service: my.ServiceName
Uses gRPC health checking protocol. Prefer this over TCP when your server already exposes grpc.health.v1.
Readiness probe in depth
Purpose: Signal that this replica can safely take traffic now. When readiness fails, the Pod stays running but Ready=False and is removed from Service Endpoints—classic load-shedding without restart.
Good readiness checks:
- HTTP handler that verifies critical dependencies (DB connection pool warm, cache reachable) with short timeouts.
- Separate from liveness: readiness may fail when the DB is down; liveness should not restart the app for that.
- Return 503 when not ready so load balancers and humans agree on semantics (kubelet still only cares about status code range).
Effects downstream:
kubectl get podsshows0/1 Ready.kubectl get endpointsomits the Pod IP.- Deployment rollout stalls until enough replicas pass readiness (or hits progress deadline).
- HPA counts Pods; Not Ready replicas still consume resources—readiness does not scale you down.
# Readiness: traffic gate
readinessProbe:
httpGet:
path: /readyz
port: 8080
periodSeconds: 5
failureThreshold: 2
successThreshold: 1
Liveness probe in depth
Purpose: Detect when the main process is alive but not progressing—deadlock, runaway goroutine leak without exit, JVM stuck in GC pause beyond recovery. The kubelet sends SIGKILL to the container (after grace period) and restarts it per restartPolicy.
Good liveness checks:
- Minimal: “Is the event loop responding?” e.g.
/livezthat does not call the database. - Slightly more than TCP open—prove the HTTP server thread pool answers.
Bad liveness checks (restart storms):
- Same endpoint as readiness, including DB ping—outage restarts all replicas instead of draining them.
- Heavy work on each probe (full table scan)—probe latency triggers kills under load.
- Checking downstream SaaS—external blip reboots your entire fleet.
# Liveness: restart only when process is stuck
livenessProbe:
httpGet:
path: /livez
port: 8080
periodSeconds: 15
timeoutSeconds: 2
failureThreshold: 3
With default thresholds, three failed liveness probes over ~30 seconds trigger restart. Tune from production metrics, not from “it worked on minikube.”
Startup probe in depth
Added in Kubernetes 1.16 to solve slow-start JVMs, large model loads, and migration-on-boot apps. While startup has not succeeded:
- Liveness and readiness probes are disabled.
- Only the startup probe runs.
- After startup succeeds once, startup is disabled for the life of that container; liveness and readiness take over.
If startup fails failureThreshold times, the container is killed and restarted—same as liveness failure.
startupProbe:
httpGet:
path: /healthz
port: 8080
periodSeconds: 10
failureThreshold: 30 # 30 * 10s = up to ~5 min boot window
livenessProbe:
httpGet:
path: /livez
port: 8080
periodSeconds: 10
failureThreshold: 3
readinessProbe:
httpGet:
path: /readyz
port: 8080
periodSeconds: 5
failureThreshold: 3
Pattern: Use startup with a generous failureThreshold × periodSeconds budget instead of inflating livenessProbe.initialDelaySeconds to 300. Large initial delays delay liveness protection for the entire container lifetime after the first boot.
Timing fields reference
| Field | Meaning | Default |
|---|---|---|
initialDelaySeconds | Wait after container start before first probe | 0 |
periodSeconds | Interval between probes | 10 |
timeoutSeconds | Probe must complete within this time | 1 |
successThreshold | Consecutive successes to flip to Success | 1 |
failureThreshold | Consecutive failures to flip to Failure | 3 |
Readiness-only nuance: successThreshold can be > 1 on readiness (e.g. require 2 successes before Ready) to avoid flapping endpoints on bursty apps—rare but valid.
Termination: Failed liveness respects terminationGracePeriodSeconds on the Pod spec before SIGKILL. Align probe timeouts with graceful shutdown hooks so in-flight requests drain.
Lifecycle: from Pod scheduled to receiving traffic
Container created
→ (optional) startupProbe runs until Success or kill
→ readinessProbe runs → Pod Ready=True → Endpoints include IP
→ livenessProbe runs in parallel with readiness
→ liveness Failure → container restart → readiness drops until boot completes again
During rolling updates, new Pods must pass readiness before old Pods terminate (default RollingUpdate strategy). Combine with minReadySeconds on the Deployment so a Pod must stay Ready for N seconds before counting as available—catches “ready for one probe then crash” races.
Separate /livez, /readyz, and /startup semantics in your app
Google’s health check pattern maps cleanly to Kubernetes:
- /livez — process up; cheap; no external deps (or optional “degraded but alive”).
- /readyz — OK to serve user traffic; check DB, queue, feature flags.
- /healthz — often used for startup or aggregate health in docs; pick one convention per team and document it.
Spring Boot Actuator: /actuator/health/liveness and /actuator/health/readiness. ASP.NET Core: MapHealthChecks with separate predicates. Go: wrap net/http mux with distinct handlers.
Multi-container Pods and initContainers
- Each container in a Pod may define its own probes. Sidecars (Envoy, log shippers) need their own liveness—do not only probe the main app container.
- initContainers run to completion before app containers start; they do not support probes in the same way. Run migrations in init, then readiness verifies “schema at expected version.”
- Pod-level
Readyrequires all containers’ readiness (and no startup in progress) to succeed.
Jobs, CronJobs, and probes
Short-lived Job Pods often omit probes—the Pod exits when work finishes. Long-running workers in Deployments need readiness so Services do not send traffic to starting replicas. For CronJobs, probes matter only if the job Pod exposes a Service (unusual).
Docker HEALTHCHECK vs Kubernetes probes
HEALTHCHECK in a Dockerfile is interpreted by Docker Engine on single-host runs. Kubernetes ignores Dockerfile HEALTHCHECK for workload health—it uses manifest probes. Keep Dockerfile HEALTHCHECK for local docker run if useful, but always duplicate intent in Pod spec for cluster deploys. See Docker — the hidden side.
Production checklist
- Readiness checks dependencies; liveness checks only process health.
- Startup probe for any container that boots in > ~30s.
- Probe endpoints are unauthenticated or use a cheap token—never redirect to OAuth login.
timeoutSeconds<periodSeconds; probe handler completes in milliseconds under normal load.- Document expected boot time; set
failureThreshold × periodSecondsabove p99 startup. - Test failure modes in staging: stop DB, verify readiness fails and liveness does not restart.
- Align with
preStophook andterminationGracePeriodSecondsfor zero-downtime deploys. - Review Helm chart defaults—many charts ship HTTP probes on
/that always return 200 from nginx while your app is down.
Anti-patterns
| Anti-pattern | Symptom | Fix |
|---|---|---|
| One probe for everything | DB outage → CrashLoopBackOff fleet-wide | Split livez vs readyz; readiness only on deps |
Liveness on initialDelaySeconds: 300 only |
Stuck process not restarted for 5 minutes | startupProbe for boot; short liveness period after |
| Readiness always passing | 502s from Service during deploy | Implement real readyz; verify Endpoints during rollout |
| Probe hits admin port requiring mTLS | Random Not Ready | Dedicated plain HTTP health port on loopback or separate listener |
Exec probe calling curl to internet |
Slow probes; flaky under DNS issues | In-process HTTP handler or local TCP |
Debugging probe failures
# Events and probe config
kubectl describe pod <name> -n <ns>
# Look for: Liveness probe failed, Readiness probe failed, Unhealthy
# Test the same path from inside the Pod
kubectl exec -it <name> -n <ns> -c <container> -- \
wget -qO- http://127.0.0.1:8080/readyz
# Previous crash logs after liveness restart
kubectl logs <name> -n <ns> -c <container> --previous
# Is the Pod in Endpoints?
kubectl get endpoints <service> -n <ns> -o yaml
Common event strings:
Liveness probe failed: HTTP probe failed with statuscode: 500— fix app or point liveness to a lighter path.Readiness probe failed: dial tcp ... connection refused— app not listening yet; add startup or increase budget.context deadline exceeded— probe slower thantimeoutSeconds; optimize handler or raise timeout slightly.
More scenarios: K8s troubleshooting playbook and hands-on Part 5 — debug.
Complete Deployment example
apiVersion: apps/v1
kind: Deployment
metadata:
name: api
namespace: learn-dev
spec:
replicas: 3
minReadySeconds: 10
selector:
matchLabels:
app.kubernetes.io/name: api
template:
metadata:
labels:
app.kubernetes.io/name: api
app.kubernetes.io/version: "1.2.0"
spec:
terminationGracePeriodSeconds: 30
containers:
- name: api
image: your-registry/api:1.2.0
ports:
- containerPort: 8080
startupProbe:
httpGet:
path: /healthz
port: 8080
periodSeconds: 5
failureThreshold: 24
readinessProbe:
httpGet:
path: /readyz
port: 8080
periodSeconds: 5
failureThreshold: 2
livenessProbe:
httpGet:
path: /livez
port: 8080
periodSeconds: 15
failureThreshold: 3
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 5"]
Hands-on lab (local cluster)
- Deploy nginx without probes;
kubectl execkill worker—Service still sends traffic until you notice manually. - Add readiness on
/; scale to 2;kubectl execbreak one Pod—watchendpointsdrop an IP. - Add liveness with
initialDelaySeconds: 0and wrong port—observe CrashLoopBackOff. - Replace with startupProbe (long
failureThreshold) + liveness on correct port—simulate slow start withsleep 60entrypoint and compare outcomes.
Lab prerequisites: Part 1 — local lab, Part 3 — first workloads, Part 4 — day-one practices.
Further reading on this site
- Kubernetes architecture — kubelet and Pod lifecycle
- Day-one best practices — when to add probes before calling a Deployment production-ready
- Troubleshooting playbook — CrashLoopBackOff and Not Ready playbooks
- CRI and CSI — probes run in the kubelet/CRI path, not storage
- GitOps principles — probe changes reviewed like any manifest
- Incident response — restart storms during outages