The Kubernetes Troubleshooting Playbook

Most production incidents fall into a dozen familiar buckets—scheduling, images, crashes, config, networking, storage, RBAC, rollouts. This playbook gives you a repeatable order of operations, symptom-by-symptom fixes, and the commands operators reach for daily.

In short

Observe (get → describe → logs → events) → classify the failure layer → apply the matching fix → verify with endpoints, probes, and a second user’s eyes. Bookmark this page for on-call.

How to use this playbook

Each section follows the same shape: symptomwhat it usually meanscommandsfixprevent recurrence. Start with the universal workflow below; jump to the section that matches what you see in kubectl get pods or your alert.

New to Kubernetes debugging? Read Hands-On Part 5: Debug and next steps first, then return here for production-scale failure modes.

Universal workflow (every incident)

  1. Scope: namespace, workload name, when it started (deploy? node drain? cert expiry?).
  2. Observe: kubectl get pods,deploy,svc,ingress -n <ns>
  3. Explain: kubectl describe pod <pod> -n <ns> — read Events at the bottom.
  4. Logs: kubectl logs <pod> -n <ns> [-c container] [--previous]
  5. Timeline: kubectl get events -n <ns> --sort-by='.lastTimestamp'
  6. Verify fix: readiness, endpoints, curl from a debug pod, synthetic check.
# One-liner context (replace NS and labels)
export NS=production APP=my-api
kubectl get pods -n $NS -l app.kubernetes.io/name=$APP -o wide
kubectl describe pod -n $NS -l app.kubernetes.io/name=$APP | tail -40
kubectl logs -n $NS -l app.kubernetes.io/name=$APP --tail=100 --all-containers=true

Failure layers (where to look first)

LayerTypical symptomsFirst commands
Cluster / control planeAPI timeouts, nothing scheduleskubectl get --raw /healthz, control-plane logs
NodeNotReady, widespread Pendingkubectl describe node, kubelet logs
SchedulerPod Pending, taint/toleration messagesdescribe pod Events
WorkloadCrashLoopBackOff, probe failureslogs --previous, describe
ConfigCreateContainerConfigErrorSecrets, ConfigMaps, env refs
Networktimeouts, 502/503get endpoints, DNS, NetworkPolicy
StoragePVC Pending, mount errorsget pvc,pv, CSI driver pods
RBACForbidden from app or CIkubectl auth can-i

Playbook: Pod stuck in Pending

Symptom: Pod never reaches Running; Events mention scheduling, volumes, or quotas.

Common causes:

  • Insufficient CPU/memory on nodes (requests too high).
  • Node selector, affinity, or taints without matching tolerations.
  • PVC not bound (unbound immediate PersistentVolumeClaims).
  • ResourceQuota or LimitRange exceeded in namespace.
  • PodSecurity / admission webhook rejection (check Events).
kubectl describe pod <pod> -n <ns>
kubectl describe nodes | grep -A5 "Allocated resources"
kubectl get pvc -n <ns>
kubectl describe resourcequota -n <ns>

Fix: Right-size requests, add nodes (cluster autoscaler), fix PVC/StorageClass, add tolerations only when justified, or raise quota. Prevent: Set realistic requests/limits; monitor allocatable vs requested; test PVC provisioning in staging.

Playbook: ImagePullBackOff / ErrImagePull

Symptom: Failed to pull image in Events.

Common causes: Wrong image name or tag; image deleted from registry; private registry without imagePullSecrets; registry rate limits; architecture mismatch (arm vs amd).

kubectl describe pod <pod> -n <ns> | grep -i image
# Private registry
kubectl create secret docker-registry regcred \
  --docker-server=<registry> --docker-username=<u> --docker-password=<p> -n <ns>
# Reference in pod spec: imagePullSecrets: [{ name: regcred }]

Fix: Correct image digest/tag; attach pull secret to ServiceAccount or Pod; use immutable tags in prod. Prevent: CI verifies image exists before deploy; pin digests for critical services.

Playbook: CrashLoopBackOff

Symptom: Container starts, exits, backoff increases; restart count climbs.

Common causes: Application panic on boot; wrong command/args; missing env or config file; listening on wrong port; migration job logic in long-running container; liveness probe killing app too aggressively.

kubectl logs <pod> -n <ns> --previous
kubectl logs <pod> -n <ns> -c <container> --previous
# Run same image locally with same command/env
docker run --rm -it <image> <same-entrypoint>

Fix: Fix app startup; mount ConfigMap/Secret; align probe initialDelaySeconds with real boot time; split init logic into initContainers. Prevent: Staging deploy with identical env; startup probes for slow apps.

Playbook: OOMKilled (exit 137)

Symptom: Last State: Terminated, Reason: OOMKilled; sudden restarts under load.

Common causes: Memory limit too low; memory leak; JVM/Node heap not aligned with container limit; no limits set (node eviction instead).

kubectl describe pod <pod> -n <ns> | grep -A2 Limits
kubectl top pod <pod> -n <ns>   # needs metrics-server

Fix: Raise resources.limits.memory after profiling; set requests ≈ steady state; configure app heap ≤ ~75% of limit. Prevent: Load tests; alerts on memory working set vs limit; VPA or rightsizing reviews.

Playbook: CreateContainerConfigError

Symptom: Pod cannot start; Events reference missing key or secret.

Common causes: Secret/ConfigMap key typo; object in wrong namespace; optional key not optional; projected volume path clash.

kubectl get secret,configmap -n <ns>
kubectl describe pod <pod> -n <ns>
# Compare envFrom / volumeMount keys to actual data keys

Fix: Create or fix Secret/ConfigMap; sync from External Secrets Operator; fix Helm/Kustomize templates. Prevent: Pre-deploy validation in CI; never hand-edit prod Secrets without GitOps trail.

Playbook: Running but not Ready (probe failures)

Symptom: Pod Running, READY 0/1; Service has no endpoints; Ingress returns 502.

Common causes: Readiness probe wrong path/port/scheme; app listens on 127.0.0.1 only; dependency (DB) down; probe too aggressive during warmup.

kubectl describe pod <pod> -n <ns> | grep -A10 "Liveness\|Readiness"
kubectl exec -it <pod> -n <ns> -- wget -qO- http://127.0.0.1:<port>/health
# Or curl from debug pod to Service ClusterIP

Fix: Match probe to real health endpoint; use startupProbe for slow boot; fix upstream dependency. Prevent: Chart defaults reviewed per app; synthetic checks in staging.

Playbook: Service has no endpoints

Symptom: kubectl get endpoints <svc> shows empty subsets; in-cluster DNS resolves but connection refused or timeout.

Common causes: Service selector does not match Pod labels; Pods not Ready; wrong port targetPort (name vs number); headless Service misuse.

kubectl get svc <svc> -n <ns> -o yaml | grep -A5 selector
kubectl get pods -n <ns> --show-labels
kubectl get endpoints <svc> -n <ns>
kubectl run debug --rm -it --image=curlimages/curl --restart=Never -- \
  curl -v http://<svc>.<ns>.svc.cluster.local:<port>/

Fix: Align labels and selectors; fix targetPort to container port; ensure readiness passes. See first workloads (Service).

Playbook: DNS failures (NXDOMAIN / timeout)

Symptom: App logs no such host for *.svc.cluster.local; intermittent resolution.

Common causes: Wrong service name or namespace; CoreDNS pods unhealthy; ndots search path issues; NetworkPolicy blocking UDP 53; custom DNS config in Pod spec.

kubectl get pods -n kube-system -l k8s-app=kube-dns
kubectl logs -n kube-system -l k8s-app=kube-dns
kubectl run dns-test --rm -it --image=busybox:1.36 --restart=Never -- \
  nslookup kubernetes.default.svc.cluster.local

Fix: Use FQDN service.namespace.svc.cluster.local; repair CoreDNS; adjust NetworkPolicy; fix dnsPolicy/dnsConfig.

Playbook: Ingress returns 404 / 502 / 504

Symptom: External URL fails; in-cluster Service works via port-forward.

Common causes: No ingress controller installed or running; wrong ingressClassName; TLS secret missing or expired; backend Service has no endpoints; path/rule mismatch; timeout too low for slow upstream.

kubectl get ingress -n <ns>
kubectl describe ingress <name> -n <ns>
kubectl get pods -n ingress-nginx   # or your controller namespace
kubectl get certificate -n <ns>    # cert-manager

Fix: Install/fix controller; align host/path rules; renew certs; increase proxy timeouts. Prevent: cert-manager with alerts 30 days before expiry.

Playbook: NetworkPolicy blocked traffic

Symptom: Worked until policy applied; works from some namespaces only; timeouts with no app error.

Common causes: Egress denied to DNS, API, or database; ingress only from wrong namespace label; CNI does not enforce NetworkPolicy.

kubectl get networkpolicy -n <ns>
# Temporarily test: clone policy in staging with broader rules, narrow down
# Many teams use policy visualization tools or eBPF flow logs

Fix: Add allow rules for required labels/ports (including kube-dns); document expected flows. Prevent: Default-deny only after mapping dependencies.

Playbook: PVC Pending / volume mount failures

Symptom: PVC Pending; Pod Events: failed to mount volume or Multi-Attach error.

Common causes: No StorageClass or provisioner down; zone mismatch; RWO volume attached to second node; CSI driver crash; fsGroup permission issues.

kubectl get pvc,pv,storageclass
kubectl describe pvc <claim> -n <ns>
kubectl get pods -n kube-system | grep -i csi

Fix: Correct StorageClass; ensure single writer for RWO; restart CSI node plugin if stuck; use RWX or shared storage when multiple replicas need disk. Deep dive: PV, PVC, and StorageClass.

Playbook: Deployment rollout stuck

Symptom: kubectl rollout status hangs; old ReplicaSet still serves traffic; ProgressDeadlineExceeded.

Common causes: New Pods never become Ready; maxUnavailable / maxSurge with too few replicas; PDB blocking drain; image pull failure on new version only.

kubectl rollout status deployment/<name> -n <ns>
kubectl describe deployment <name> -n <ns>
kubectl get rs -n <ns> -l app=<app>
kubectl rollout undo deployment/<name> -n <ns>   # emergency

Fix: Fix new Pod template; pause rollout (kubectl rollout pause), fix, resume; undo if bad release. Prevent: Canary or blue-green; automated smoke tests in pipeline.

Playbook: Forbidden (RBAC)

Symptom: CI, operator, or in-cluster app gets 403 Forbidden from API.

kubectl auth can-i create pods --as=system:serviceaccount:<ns>:<sa> -n <ns>
kubectl describe rolebinding,clusterrolebinding -n <ns> | grep -A3 Subjects

Fix: Grant least-privilege Role/RoleBinding; fix wrong ServiceAccount on Pod. Guide: Kubernetes cluster RBAC.

Playbook: Node NotReady / eviction storm

Symptom: Many Pods rescheduling; node NotReady; disk pressure or memory pressure taints.

kubectl describe node <node>
kubectl get pods -A --field-selector spec.nodeName=<node>
# On node (if SSH allowed): journalctl -u kubelet -f

Fix: Free disk (image gc, log rotation); fix kubelet/CNI; cordon/drain bad node; replace hardware. Prevent: Node problem detector alerts; PDBs so one node loss does not take the service down.

Playbook: API server slow or failing

Symptom: kubectl timeouts; controllers lag; etcd alarms.

Common causes: etcd latency or full disk; excessive objects (thousands of Secrets); admission webhook timeout; audit log volume.

Fix: Scale control plane; defragment/compaction per runbook; fix webhook; reduce list-watch churn. Escalate to platform/SRE—this is rarely fixed from a single namespace.

Playbook: Certificate and TLS errors

Symptom: Browser or client certificate expired; Ingress TLS handshake fail; mesh mTLS reject.

kubectl get certificate,certificaterequest -n <ns>
kubectl describe certificate <name> -n <ns>
openssl s_client -connect <host>:443 -servername <host>

Fix: Renew via cert-manager ClusterIssuer; fix DNS-01/HTTP-01 challenge; rotate Istio/mesh certs. Prevent: Alert on cert expiry < 14 days.

Playbook: HPA not scaling

Symptom: Load high but replica count unchanged; FailedGetScale or metrics unavailable.

kubectl describe hpa <name> -n <ns>
kubectl top pods -n <ns>
kubectl get apiservice | grep metrics

Fix: Install metrics-server or Prometheus adapter; set correct scaleTargetRef; define requests so CPU utilization is meaningful; check min/max replicas. See metrics-server in depth and HPA in depth.

Quick reference: status → action

What you seeFirst action
Pendingdescribe pod → scheduler / PVC / quota
ImagePullBackOffVerify image + imagePullSecrets
CrashLoopBackOfflogs --previous
OOMKilledRaise memory limit; profile heap
CreateContainerConfigErrorSecret/ConfigMap keys
Running, not ReadyReadiness probe + exec curl
Service works via port-forward onlyIngress / endpoints / selector
PVC PendingStorageClass + CSI
Forbiddenauth can-i + bindings
Rollout stuckdescribe deploy + rollout undo

Tools worth adding (after kubectl)

  • kubectl debug — ephemeral debug containers (copy target namespace/kubeconfig).
  • stern — multi-pod log tail by label.
  • k9s — fast cluster navigation.
  • Prometheus/Grafana — correlate restarts with memory and latency.
  • Network flow — Cilium Hubble, Calico flow logs when policies are suspect.

Related posts

← K8s debug (hands-on) · Blog index

Hands-on debug All posts