The Kubernetes Troubleshooting Playbook
Most production incidents fall into a dozen familiar buckets—scheduling, images, crashes, config, networking, storage, RBAC, rollouts. This playbook gives you a repeatable order of operations, symptom-by-symptom fixes, and the commands operators reach for daily.
In short
Observe (get → describe → logs → events) → classify the failure layer → apply the matching fix → verify with endpoints, probes, and a second user’s eyes. Bookmark this page for on-call.
How to use this playbook
Each section follows the same shape: symptom → what it usually means → commands → fix → prevent recurrence. Start with the universal workflow below; jump to the section that matches what you see in kubectl get pods or your alert.
New to Kubernetes debugging? Read Hands-On Part 5: Debug and next steps first, then return here for production-scale failure modes.
Universal workflow (every incident)
- Scope: namespace, workload name, when it started (deploy? node drain? cert expiry?).
- Observe:
kubectl get pods,deploy,svc,ingress -n <ns> - Explain:
kubectl describe pod <pod> -n <ns>— read Events at the bottom. - Logs:
kubectl logs <pod> -n <ns> [-c container] [--previous] - Timeline:
kubectl get events -n <ns> --sort-by='.lastTimestamp' - Verify fix: readiness, endpoints, curl from a debug pod, synthetic check.
# One-liner context (replace NS and labels)
export NS=production APP=my-api
kubectl get pods -n $NS -l app.kubernetes.io/name=$APP -o wide
kubectl describe pod -n $NS -l app.kubernetes.io/name=$APP | tail -40
kubectl logs -n $NS -l app.kubernetes.io/name=$APP --tail=100 --all-containers=true
Failure layers (where to look first)
| Layer | Typical symptoms | First commands |
|---|---|---|
| Cluster / control plane | API timeouts, nothing schedules | kubectl get --raw /healthz, control-plane logs |
| Node | NotReady, widespread Pending | kubectl describe node, kubelet logs |
| Scheduler | Pod Pending, taint/toleration messages | describe pod Events |
| Workload | CrashLoopBackOff, probe failures | logs --previous, describe |
| Config | CreateContainerConfigError | Secrets, ConfigMaps, env refs |
| Network | timeouts, 502/503 | get endpoints, DNS, NetworkPolicy |
| Storage | PVC Pending, mount errors | get pvc,pv, CSI driver pods |
| RBAC | Forbidden from app or CI | kubectl auth can-i |
Playbook: Pod stuck in Pending
Symptom: Pod never reaches Running; Events mention scheduling, volumes, or quotas.
Common causes:
- Insufficient CPU/memory on nodes (requests too high).
- Node selector, affinity, or taints without matching tolerations.
- PVC not bound (
unbound immediate PersistentVolumeClaims). - ResourceQuota or LimitRange exceeded in namespace.
- PodSecurity / admission webhook rejection (check Events).
kubectl describe pod <pod> -n <ns>
kubectl describe nodes | grep -A5 "Allocated resources"
kubectl get pvc -n <ns>
kubectl describe resourcequota -n <ns>
Fix: Right-size requests, add nodes (cluster autoscaler), fix PVC/StorageClass, add tolerations only when justified, or raise quota. Prevent: Set realistic requests/limits; monitor allocatable vs requested; test PVC provisioning in staging.
Playbook: ImagePullBackOff / ErrImagePull
Symptom: Failed to pull image in Events.
Common causes: Wrong image name or tag; image deleted from registry; private registry without imagePullSecrets; registry rate limits; architecture mismatch (arm vs amd).
kubectl describe pod <pod> -n <ns> | grep -i image
# Private registry
kubectl create secret docker-registry regcred \
--docker-server=<registry> --docker-username=<u> --docker-password=<p> -n <ns>
# Reference in pod spec: imagePullSecrets: [{ name: regcred }]
Fix: Correct image digest/tag; attach pull secret to ServiceAccount or Pod; use immutable tags in prod. Prevent: CI verifies image exists before deploy; pin digests for critical services.
Playbook: CrashLoopBackOff
Symptom: Container starts, exits, backoff increases; restart count climbs.
Common causes: Application panic on boot; wrong command/args; missing env or config file; listening on wrong port; migration job logic in long-running container; liveness probe killing app too aggressively.
kubectl logs <pod> -n <ns> --previous
kubectl logs <pod> -n <ns> -c <container> --previous
# Run same image locally with same command/env
docker run --rm -it <image> <same-entrypoint>
Fix: Fix app startup; mount ConfigMap/Secret; align probe initialDelaySeconds with real boot time; split init logic into initContainers. Prevent: Staging deploy with identical env; startup probes for slow apps.
Playbook: OOMKilled (exit 137)
Symptom: Last State: Terminated, Reason: OOMKilled; sudden restarts under load.
Common causes: Memory limit too low; memory leak; JVM/Node heap not aligned with container limit; no limits set (node eviction instead).
kubectl describe pod <pod> -n <ns> | grep -A2 Limits
kubectl top pod <pod> -n <ns> # needs metrics-server
Fix: Raise resources.limits.memory after profiling; set requests ≈ steady state; configure app heap ≤ ~75% of limit. Prevent: Load tests; alerts on memory working set vs limit; VPA or rightsizing reviews.
Playbook: CreateContainerConfigError
Symptom: Pod cannot start; Events reference missing key or secret.
Common causes: Secret/ConfigMap key typo; object in wrong namespace; optional key not optional; projected volume path clash.
kubectl get secret,configmap -n <ns>
kubectl describe pod <pod> -n <ns>
# Compare envFrom / volumeMount keys to actual data keys
Fix: Create or fix Secret/ConfigMap; sync from External Secrets Operator; fix Helm/Kustomize templates. Prevent: Pre-deploy validation in CI; never hand-edit prod Secrets without GitOps trail.
Playbook: Running but not Ready (probe failures)
Symptom: Pod Running, READY 0/1; Service has no endpoints; Ingress returns 502.
Common causes: Readiness probe wrong path/port/scheme; app listens on 127.0.0.1 only; dependency (DB) down; probe too aggressive during warmup.
kubectl describe pod <pod> -n <ns> | grep -A10 "Liveness\|Readiness"
kubectl exec -it <pod> -n <ns> -- wget -qO- http://127.0.0.1:<port>/health
# Or curl from debug pod to Service ClusterIP
Fix: Match probe to real health endpoint; use startupProbe for slow boot; fix upstream dependency. Prevent: Chart defaults reviewed per app; synthetic checks in staging.
Playbook: Service has no endpoints
Symptom: kubectl get endpoints <svc> shows empty subsets; in-cluster DNS resolves but connection refused or timeout.
Common causes: Service selector does not match Pod labels; Pods not Ready; wrong port targetPort (name vs number); headless Service misuse.
kubectl get svc <svc> -n <ns> -o yaml | grep -A5 selector
kubectl get pods -n <ns> --show-labels
kubectl get endpoints <svc> -n <ns>
kubectl run debug --rm -it --image=curlimages/curl --restart=Never -- \
curl -v http://<svc>.<ns>.svc.cluster.local:<port>/
Fix: Align labels and selectors; fix targetPort to container port; ensure readiness passes. See first workloads (Service).
Playbook: DNS failures (NXDOMAIN / timeout)
Symptom: App logs no such host for *.svc.cluster.local; intermittent resolution.
Common causes: Wrong service name or namespace; CoreDNS pods unhealthy; ndots search path issues; NetworkPolicy blocking UDP 53; custom DNS config in Pod spec.
kubectl get pods -n kube-system -l k8s-app=kube-dns
kubectl logs -n kube-system -l k8s-app=kube-dns
kubectl run dns-test --rm -it --image=busybox:1.36 --restart=Never -- \
nslookup kubernetes.default.svc.cluster.local
Fix: Use FQDN service.namespace.svc.cluster.local; repair CoreDNS; adjust NetworkPolicy; fix dnsPolicy/dnsConfig.
Playbook: Ingress returns 404 / 502 / 504
Symptom: External URL fails; in-cluster Service works via port-forward.
Common causes: No ingress controller installed or running; wrong ingressClassName; TLS secret missing or expired; backend Service has no endpoints; path/rule mismatch; timeout too low for slow upstream.
kubectl get ingress -n <ns>
kubectl describe ingress <name> -n <ns>
kubectl get pods -n ingress-nginx # or your controller namespace
kubectl get certificate -n <ns> # cert-manager
Fix: Install/fix controller; align host/path rules; renew certs; increase proxy timeouts. Prevent: cert-manager with alerts 30 days before expiry.
Playbook: NetworkPolicy blocked traffic
Symptom: Worked until policy applied; works from some namespaces only; timeouts with no app error.
Common causes: Egress denied to DNS, API, or database; ingress only from wrong namespace label; CNI does not enforce NetworkPolicy.
kubectl get networkpolicy -n <ns>
# Temporarily test: clone policy in staging with broader rules, narrow down
# Many teams use policy visualization tools or eBPF flow logs
Fix: Add allow rules for required labels/ports (including kube-dns); document expected flows. Prevent: Default-deny only after mapping dependencies.
Playbook: PVC Pending / volume mount failures
Symptom: PVC Pending; Pod Events: failed to mount volume or Multi-Attach error.
Common causes: No StorageClass or provisioner down; zone mismatch; RWO volume attached to second node; CSI driver crash; fsGroup permission issues.
kubectl get pvc,pv,storageclass
kubectl describe pvc <claim> -n <ns>
kubectl get pods -n kube-system | grep -i csi
Fix: Correct StorageClass; ensure single writer for RWO; restart CSI node plugin if stuck; use RWX or shared storage when multiple replicas need disk. Deep dive: PV, PVC, and StorageClass.
Playbook: Deployment rollout stuck
Symptom: kubectl rollout status hangs; old ReplicaSet still serves traffic; ProgressDeadlineExceeded.
Common causes: New Pods never become Ready; maxUnavailable / maxSurge with too few replicas; PDB blocking drain; image pull failure on new version only.
kubectl rollout status deployment/<name> -n <ns>
kubectl describe deployment <name> -n <ns>
kubectl get rs -n <ns> -l app=<app>
kubectl rollout undo deployment/<name> -n <ns> # emergency
Fix: Fix new Pod template; pause rollout (kubectl rollout pause), fix, resume; undo if bad release. Prevent: Canary or blue-green; automated smoke tests in pipeline.
Playbook: Forbidden (RBAC)
Symptom: CI, operator, or in-cluster app gets 403 Forbidden from API.
kubectl auth can-i create pods --as=system:serviceaccount:<ns>:<sa> -n <ns>
kubectl describe rolebinding,clusterrolebinding -n <ns> | grep -A3 Subjects
Fix: Grant least-privilege Role/RoleBinding; fix wrong ServiceAccount on Pod. Guide: Kubernetes cluster RBAC.
Playbook: Node NotReady / eviction storm
Symptom: Many Pods rescheduling; node NotReady; disk pressure or memory pressure taints.
kubectl describe node <node>
kubectl get pods -A --field-selector spec.nodeName=<node>
# On node (if SSH allowed): journalctl -u kubelet -f
Fix: Free disk (image gc, log rotation); fix kubelet/CNI; cordon/drain bad node; replace hardware. Prevent: Node problem detector alerts; PDBs so one node loss does not take the service down.
Playbook: API server slow or failing
Symptom: kubectl timeouts; controllers lag; etcd alarms.
Common causes: etcd latency or full disk; excessive objects (thousands of Secrets); admission webhook timeout; audit log volume.
Fix: Scale control plane; defragment/compaction per runbook; fix webhook; reduce list-watch churn. Escalate to platform/SRE—this is rarely fixed from a single namespace.
Playbook: Certificate and TLS errors
Symptom: Browser or client certificate expired; Ingress TLS handshake fail; mesh mTLS reject.
kubectl get certificate,certificaterequest -n <ns>
kubectl describe certificate <name> -n <ns>
openssl s_client -connect <host>:443 -servername <host>
Fix: Renew via cert-manager ClusterIssuer; fix DNS-01/HTTP-01 challenge; rotate Istio/mesh certs. Prevent: Alert on cert expiry < 14 days.
Playbook: HPA not scaling
Symptom: Load high but replica count unchanged; FailedGetScale or metrics unavailable.
kubectl describe hpa <name> -n <ns>
kubectl top pods -n <ns>
kubectl get apiservice | grep metrics
Fix: Install metrics-server or Prometheus adapter; set correct scaleTargetRef; define requests so CPU utilization is meaningful; check min/max replicas. See metrics-server in depth and HPA in depth.
Quick reference: status → action
| What you see | First action |
|---|---|
Pending | describe pod → scheduler / PVC / quota |
ImagePullBackOff | Verify image + imagePullSecrets |
CrashLoopBackOff | logs --previous |
OOMKilled | Raise memory limit; profile heap |
CreateContainerConfigError | Secret/ConfigMap keys |
| Running, not Ready | Readiness probe + exec curl |
| Service works via port-forward only | Ingress / endpoints / selector |
PVC Pending | StorageClass + CSI |
Forbidden | auth can-i + bindings |
| Rollout stuck | describe deploy + rollout undo |
Tools worth adding (after kubectl)
kubectl debug— ephemeral debug containers (copy target namespace/kubeconfig).stern— multi-pod log tail by label.k9s— fast cluster navigation.- Prometheus/Grafana — correlate restarts with memory and latency.
- Network flow — Cilium Hubble, Calico flow logs when policies are suspect.
Related posts
- K8s hands-on Part 5 — debug basics
- Kubernetes architecture (simple)
- Storage: PV, PVC, StorageClass
- Cluster RBAC
- Incident response — staying calm
- GitOps principles