Platform & Kubernetes · 22 May 2026 · Guide · By Babulal Tamang

Kubernetes
Health checks
Production
SRE

Kubernetes Probes in Depth: Liveness, Readiness, and Startup

A container can be running while your service is broken, still booting, or wedged in a way that only a restart fixes. Kubernetes does not guess—it asks on a schedule via probes. Liveness decides whether to restart a container. Readiness decides whether the Pod should receive traffic. Startup protects slow-boot apps from premature liveness kills. Misconfigure any of the three and you get mysterious restarts, black holes in load balancers, or deployments that never finish.

In short

Use readiness for “can this instance serve requests?” (dependencies up, migrations done). Use liveness only for “is this process irrecoverably stuck?” (deadlock, infinite loop)—not for dependency outages. Use startup when boot takes longer than your liveness budget. Point all probes at cheap, app-specific health endpoints; tune periodSeconds, timeoutSeconds, and failureThreshold from real startup and failure data, not chart defaults.

Why probes exist

Before orchestrators, operators SSH’d in and ran curl localhost/health. At scale that does not work. Kubernetes embeds health checking into the kubelet on every node: on a timer it runs an HTTP request, TCP dial, exec command, or gRPC health check against each container and records success or failure.

Those results drive control-plane behavior:

The kubelet restarts containers when liveness fails (subject to restart policy).
The kubelet sets Pod.status.conditions[Ready] from readiness (and startup, while active).
Endpoints / EndpointSlice controllers only include Ready Pods in Service backends—so Ingress and ClusterIP traffic skip broken instances.
Deployment rollouts wait for new Pods to become Ready before scaling down old ones (with maxUnavailable / maxSurge and optional minReadySeconds).

Probes are not monitoring—they are control signals. For dashboards and paging, use Prometheus, OpenTelemetry, or your APM. Probes should be fast, local, and conservative enough that a blip does not drain a node.

For cluster anatomy, see Kubernetes architecture in simple terms. For when probes fail in the wild, see K8s troubleshooting playbook.

The three probe types compared

Probe	Question it answers	On failure	Affects Service endpoints?
Readiness	Should this Pod receive traffic right now?	Pod marked Not Ready; removed from Endpoints	Yes
Liveness	Is this container so broken that restart is the fix?	Container killed and restarted (per restartPolicy)	Indirectly (restart may drop Ready briefly)
Startup	Has the app finished booting yet?	Blocks liveness/readiness checks until success or cap	Readiness stays false until startup succeeds

Readiness is about routing. Liveness is about process recovery. Startup is about boot time. Treating readiness like liveness—restarting because Postgres is down—is one of the most expensive mistakes teams make.

How the kubelet runs a probe

For each configured probe, the kubelet runs a loop:

Wait initialDelaySeconds after container start (first probe only).
Execute the probe handler (HTTP GET, TCP socket open, exec in container, or gRPC).
If the handler completes within timeoutSeconds and reports success, increment success streak; else increment failure streak.
After successThreshold consecutive successes, mark probe Success. After failureThreshold consecutive failures, mark Failure and apply the probe-specific action.
Sleep periodSeconds and repeat.

Defaults (if omitted): periodSeconds: 10, timeoutSeconds: 1, successThreshold: 1, failureThreshold: 3. That means three failures roughly 20–30 seconds apart can restart a container or drop it from load balancing—faster than many humans expect.

Probe handlers: HTTP, TCP, exec, gRPC

HTTP GET (most common)

readinessProbe:
  httpGet:
    path: /healthz
    port: 8080
    scheme: HTTP
    httpHeaders:
      - name: X-Health-Token
        value: from-secret-via-env
  initialDelaySeconds: 5
  periodSeconds: 10
  timeoutSeconds: 2
  failureThreshold: 3

Success: HTTP status 200–399 (inclusive). 4xx/5xx or connection errors count as failure.
Use a dedicated path—/healthz, /readyz—not your homepage with auth redirects.
host defaults to Pod IP; rarely set to localhost unless you know why.

TCP socket

livenessProbe:
  tcpSocket:
    port: 5432

Opens a TCP connection to the port. Success means “something is listening,” not “Postgres accepts queries.” Fine for brokers with no HTTP admin port; weak for apps where the port is open but the process is wedged.

Exec

readinessProbe:
  exec:
    command:
      - /bin/sh
      - -c
      - test -f /tmp/ready

Runs a command inside the container namespace. Exit code 0 = success. Powerful but easy to make slow or flaky; every exec forks a process on the node path into the container.

gRPC (Kubernetes 1.24+)

livenessProbe:
  grpc:
    port: 50051
    service: my.ServiceName

Uses gRPC health checking protocol. Prefer this over TCP when your server already exposes grpc.health.v1.

Readiness probe in depth

Purpose: Signal that this replica can safely take traffic now. When readiness fails, the Pod stays running but Ready=False and is removed from Service Endpoints—classic load-shedding without restart.

Good readiness checks:

HTTP handler that verifies critical dependencies (DB connection pool warm, cache reachable) with short timeouts.
Separate from liveness: readiness may fail when the DB is down; liveness should not restart the app for that.
Return 503 when not ready so load balancers and humans agree on semantics (kubelet still only cares about status code range).

Effects downstream:

kubectl get pods shows 0/1 Ready.
kubectl get endpoints omits the Pod IP.
Deployment rollout stalls until enough replicas pass readiness (or hits progress deadline).
HPA counts Pods; Not Ready replicas still consume resources—readiness does not scale you down.

# Readiness: traffic gate
readinessProbe:
  httpGet:
    path: /readyz
    port: 8080
  periodSeconds: 5
  failureThreshold: 2
  successThreshold: 1

Liveness probe in depth

Purpose: Detect when the main process is alive but not progressing—deadlock, runaway goroutine leak without exit, JVM stuck in GC pause beyond recovery. The kubelet sends SIGKILL to the container (after grace period) and restarts it per restartPolicy.

Good liveness checks:

Minimal: “Is the event loop responding?” e.g. /livez that does not call the database.
Slightly more than TCP open—prove the HTTP server thread pool answers.

Bad liveness checks (restart storms):

Same endpoint as readiness, including DB ping—outage restarts all replicas instead of draining them.
Heavy work on each probe (full table scan)—probe latency triggers kills under load.
Checking downstream SaaS—external blip reboots your entire fleet.

# Liveness: restart only when process is stuck
livenessProbe:
  httpGet:
    path: /livez
    port: 8080
  periodSeconds: 15
  timeoutSeconds: 2
  failureThreshold: 3

With default thresholds, three failed liveness probes over ~30 seconds trigger restart. Tune from production metrics, not from “it worked on minikube.”

Startup probe in depth

Added in Kubernetes 1.16 to solve slow-start JVMs, large model loads, and migration-on-boot apps. While startup has not succeeded:

Liveness and readiness probes are disabled.
Only the startup probe runs.
After startup succeeds once, startup is disabled for the life of that container; liveness and readiness take over.

If startup fails failureThreshold times, the container is killed and restarted—same as liveness failure.

startupProbe:
  httpGet:
    path: /healthz
    port: 8080
  periodSeconds: 10
  failureThreshold: 30   # 30 * 10s = up to ~5 min boot window
livenessProbe:
  httpGet:
    path: /livez
    port: 8080
  periodSeconds: 10
  failureThreshold: 3
readinessProbe:
  httpGet:
    path: /readyz
    port: 8080
  periodSeconds: 5
  failureThreshold: 3

Pattern: Use startup with a generous failureThreshold × periodSeconds budget instead of inflating livenessProbe.initialDelaySeconds to 300. Large initial delays delay liveness protection for the entire container lifetime after the first boot.

Timing fields reference

Field	Meaning	Default
`initialDelaySeconds`	Wait after container start before first probe	0
`periodSeconds`	Interval between probes	10
`timeoutSeconds`	Probe must complete within this time	1
`successThreshold`	Consecutive successes to flip to Success	1
`failureThreshold`	Consecutive failures to flip to Failure	3

Readiness-only nuance: successThreshold can be > 1 on readiness (e.g. require 2 successes before Ready) to avoid flapping endpoints on bursty apps—rare but valid.

Termination: Failed liveness respects terminationGracePeriodSeconds on the Pod spec before SIGKILL. Align probe timeouts with graceful shutdown hooks so in-flight requests drain.

Lifecycle: from Pod scheduled to receiving traffic

Container created
  → (optional) startupProbe runs until Success or kill
  → readinessProbe runs → Pod Ready=True → Endpoints include IP
  → livenessProbe runs in parallel with readiness
  → liveness Failure → container restart → readiness drops until boot completes again

During rolling updates, new Pods must pass readiness before old Pods terminate (default RollingUpdate strategy). Combine with minReadySeconds on the Deployment so a Pod must stay Ready for N seconds before counting as available—catches “ready for one probe then crash” races.

Separate /livez, /readyz, and /startup semantics in your app

Google’s health check pattern maps cleanly to Kubernetes:

/livez — process up; cheap; no external deps (or optional “degraded but alive”).
/readyz — OK to serve user traffic; check DB, queue, feature flags.
/healthz — often used for startup or aggregate health in docs; pick one convention per team and document it.

Spring Boot Actuator: /actuator/health/liveness and /actuator/health/readiness. ASP.NET Core: MapHealthChecks with separate predicates. Go: wrap net/http mux with distinct handlers.

Multi-container Pods and initContainers

Each container in a Pod may define its own probes. Sidecars (Envoy, log shippers) need their own liveness—do not only probe the main app container.
initContainers run to completion before app containers start; they do not support probes in the same way. Run migrations in init, then readiness verifies “schema at expected version.”
Pod-level Ready requires all containers’ readiness (and no startup in progress) to succeed.

Jobs, CronJobs, and probes

Short-lived Job Pods often omit probes—the Pod exits when work finishes. Long-running workers in Deployments need readiness so Services do not send traffic to starting replicas. For CronJobs, probes matter only if the job Pod exposes a Service (unusual).

Docker HEALTHCHECK vs Kubernetes probes

HEALTHCHECK in a Dockerfile is interpreted by Docker Engine on single-host runs. Kubernetes ignores Dockerfile HEALTHCHECK for workload health—it uses manifest probes. Keep Dockerfile HEALTHCHECK for local docker run if useful, but always duplicate intent in Pod spec for cluster deploys. See Docker — the hidden side.

Production checklist

Readiness checks dependencies; liveness checks only process health.
Startup probe for any container that boots in > ~30s.
Probe endpoints are unauthenticated or use a cheap token—never redirect to OAuth login.
timeoutSeconds < periodSeconds; probe handler completes in milliseconds under normal load.
Document expected boot time; set failureThreshold × periodSeconds above p99 startup.
Test failure modes in staging: stop DB, verify readiness fails and liveness does not restart.
Align with preStop hook and terminationGracePeriodSeconds for zero-downtime deploys.
Review Helm chart defaults—many charts ship HTTP probes on / that always return 200 from nginx while your app is down.

Anti-patterns

Anti-pattern	Symptom	Fix
One probe for everything	DB outage → CrashLoopBackOff fleet-wide	Split livez vs readyz; readiness only on deps
Liveness on `initialDelaySeconds: 300` only	Stuck process not restarted for 5 minutes	startupProbe for boot; short liveness period after
Readiness always passing	502s from Service during deploy	Implement real readyz; verify Endpoints during rollout
Probe hits admin port requiring mTLS	Random Not Ready	Dedicated plain HTTP health port on loopback or separate listener
Exec probe calling `curl` to internet	Slow probes; flaky under DNS issues	In-process HTTP handler or local TCP

Debugging probe failures

# Events and probe config
kubectl describe pod <name> -n <ns>
# Look for: Liveness probe failed, Readiness probe failed, Unhealthy

# Test the same path from inside the Pod
kubectl exec -it <name> -n <ns> -c <container> -- \
  wget -qO- http://127.0.0.1:8080/readyz

# Previous crash logs after liveness restart
kubectl logs <name> -n <ns> -c <container> --previous

# Is the Pod in Endpoints?
kubectl get endpoints <service> -n <ns> -o yaml

Common event strings:

Liveness probe failed: HTTP probe failed with statuscode: 500 — fix app or point liveness to a lighter path.
Readiness probe failed: dial tcp ... connection refused — app not listening yet; add startup or increase budget.
context deadline exceeded — probe slower than timeoutSeconds; optimize handler or raise timeout slightly.

More scenarios: K8s troubleshooting playbook and hands-on Part 5 — debug.

Complete Deployment example

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api
  namespace: learn-dev
spec:
  replicas: 3
  minReadySeconds: 10
  selector:
    matchLabels:
      app.kubernetes.io/name: api
  template:
    metadata:
      labels:
        app.kubernetes.io/name: api
        app.kubernetes.io/version: "1.2.0"
    spec:
      terminationGracePeriodSeconds: 30
      containers:
        - name: api
          image: your-registry/api:1.2.0
          ports:
            - containerPort: 8080
          startupProbe:
            httpGet:
              path: /healthz
              port: 8080
            periodSeconds: 5
            failureThreshold: 24
          readinessProbe:
            httpGet:
              path: /readyz
              port: 8080
            periodSeconds: 5
            failureThreshold: 2
          livenessProbe:
            httpGet:
              path: /livez
              port: 8080
            periodSeconds: 15
            failureThreshold: 3
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh", "-c", "sleep 5"]

Hands-on lab (local cluster)

Deploy nginx without probes; kubectl exec kill worker—Service still sends traffic until you notice manually.
Add readiness on /; scale to 2; kubectl exec break one Pod—watch endpoints drop an IP.
Add liveness with initialDelaySeconds: 0 and wrong port—observe CrashLoopBackOff.
Replace with startupProbe (long failureThreshold) + liveness on correct port—simulate slow start with sleep 60 entrypoint and compare outcomes.

Lab prerequisites: Part 1 — local lab, Part 3 — first workloads, Part 4 — day-one practices.

Blog