Kubernetes · 5 Jun 2026 · Playbook · By Babulal Tamang

Kubernetes
Troubleshooting
kubectl
SRE

The Kubernetes Troubleshooting Playbook

Most production incidents fall into a dozen familiar buckets—scheduling, images, crashes, config, networking, storage, RBAC, rollouts. This playbook gives you a repeatable order of operations, symptom-by-symptom fixes, and the commands operators reach for daily.

In short

Observe (get → describe → logs → events) → classify the failure layer → apply the matching fix → verify with endpoints, probes, and a second user’s eyes. Bookmark this page for on-call.

How to use this playbook

Each section follows the same shape: symptom → what it usually means → commands → fix → prevent recurrence. Start with the universal workflow below; jump to the section that matches what you see in kubectl get pods or your alert.

New to Kubernetes debugging? Read Hands-On Part 5: Debug and next steps first, then return here for production-scale failure modes.

Universal workflow (every incident)

Scope: namespace, workload name, when it started (deploy? node drain? cert expiry?).
Observe: kubectl get pods,deploy,svc,ingress -n <ns>
Explain: kubectl describe pod <pod> -n <ns> — read Events at the bottom.
Logs: kubectl logs <pod> -n <ns> [-c container] [--previous]
Timeline: kubectl get events -n <ns> --sort-by='.lastTimestamp'
Verify fix: readiness, endpoints, curl from a debug pod, synthetic check.

# One-liner context (replace NS and labels)
export NS=production APP=my-api
kubectl get pods -n $NS -l app.kubernetes.io/name=$APP -o wide
kubectl describe pod -n $NS -l app.kubernetes.io/name=$APP | tail -40
kubectl logs -n $NS -l app.kubernetes.io/name=$APP --tail=100 --all-containers=true

Failure layers (where to look first)

Layer	Typical symptoms	First commands
Cluster / control plane	API timeouts, nothing schedules	`kubectl get --raw /healthz`, control-plane logs
Node	`NotReady`, widespread `Pending`	`kubectl describe node`, kubelet logs
Scheduler	Pod `Pending`, taint/toleration messages	`describe pod` Events
Workload	`CrashLoopBackOff`, probe failures	`logs --previous`, `describe`
Config	`CreateContainerConfigError`	Secrets, ConfigMaps, env refs
Network	timeouts, 502/503	`get endpoints`, DNS, NetworkPolicy
Storage	PVC `Pending`, mount errors	`get pvc,pv`, CSI driver pods
RBAC	`Forbidden` from app or CI	`kubectl auth can-i`

Playbook: Pod stuck in Pending

Symptom: Pod never reaches Running; Events mention scheduling, volumes, or quotas.

Common causes:

Insufficient CPU/memory on nodes (requests too high).
Node selector, affinity, or taints without matching tolerations.
PVC not bound (unbound immediate PersistentVolumeClaims).
ResourceQuota or LimitRange exceeded in namespace.
PodSecurity / admission webhook rejection (check Events).

kubectl describe pod <pod> -n <ns>
kubectl describe nodes | grep -A5 "Allocated resources"
kubectl get pvc -n <ns>
kubectl describe resourcequota -n <ns>

Fix: Right-size requests, add nodes (cluster autoscaler), fix PVC/StorageClass, add tolerations only when justified, or raise quota. Prevent: Set realistic requests/limits; monitor allocatable vs requested; test PVC provisioning in staging.

Playbook: ImagePullBackOff / ErrImagePull

Symptom: Failed to pull image in Events.

Common causes: Wrong image name or tag; image deleted from registry; private registry without imagePullSecrets; registry rate limits; architecture mismatch (arm vs amd).

kubectl describe pod <pod> -n <ns> | grep -i image
# Private registry
kubectl create secret docker-registry regcred \
  --docker-server=<registry> --docker-username=<u> --docker-password=<p> -n <ns>
# Reference in pod spec: imagePullSecrets: [{ name: regcred }]

Fix: Correct image digest/tag; attach pull secret to ServiceAccount or Pod; use immutable tags in prod. Prevent: CI verifies image exists before deploy; pin digests for critical services.

Playbook: CrashLoopBackOff

Symptom: Container starts, exits, backoff increases; restart count climbs.

Common causes: Application panic on boot; wrong command/args; missing env or config file; listening on wrong port; migration job logic in long-running container; liveness probe killing app too aggressively.

kubectl logs <pod> -n <ns> --previous
kubectl logs <pod> -n <ns> -c <container> --previous
# Run same image locally with same command/env
docker run --rm -it <image> <same-entrypoint>

Fix: Fix app startup; mount ConfigMap/Secret; align probe initialDelaySeconds with real boot time; split init logic into initContainers. Prevent: Staging deploy with identical env; startup probes for slow apps.

Playbook: OOMKilled (exit 137)

Symptom: Last State: Terminated, Reason: OOMKilled; sudden restarts under load.

Common causes: Memory limit too low; memory leak; JVM/Node heap not aligned with container limit; no limits set (node eviction instead).

kubectl describe pod <pod> -n <ns> | grep -A2 Limits
kubectl top pod <pod> -n <ns>   # needs metrics-server

Fix: Raise resources.limits.memory after profiling; set requests ≈ steady state; configure app heap ≤ ~75% of limit. Prevent: Load tests; alerts on memory working set vs limit; VPA or rightsizing reviews.

Playbook: CreateContainerConfigError

Symptom: Pod cannot start; Events reference missing key or secret.

Common causes: Secret/ConfigMap key typo; object in wrong namespace; optional key not optional; projected volume path clash.

kubectl get secret,configmap -n <ns>
kubectl describe pod <pod> -n <ns>
# Compare envFrom / volumeMount keys to actual data keys

Fix: Create or fix Secret/ConfigMap; sync from External Secrets Operator; fix Helm/Kustomize templates. Prevent: Pre-deploy validation in CI; never hand-edit prod Secrets without GitOps trail.

Playbook: Running but not Ready (probe failures)

Symptom: Pod Running, READY 0/1; Service has no endpoints; Ingress returns 502.

Common causes: Readiness probe wrong path/port/scheme; app listens on 127.0.0.1 only; dependency (DB) down; probe too aggressive during warmup.

kubectl describe pod <pod> -n <ns> | grep -A10 "Liveness\|Readiness"
kubectl exec -it <pod> -n <ns> -- wget -qO- http://127.0.0.1:<port>/health
# Or curl from debug pod to Service ClusterIP

Fix: Match probe to real health endpoint; use startupProbe for slow boot; fix upstream dependency. Prevent: Chart defaults reviewed per app; synthetic checks in staging.

Playbook: Service has no endpoints

Symptom: kubectl get endpoints <svc> shows empty subsets; in-cluster DNS resolves but connection refused or timeout.

Common causes: Service selector does not match Pod labels; Pods not Ready; wrong port targetPort (name vs number); headless Service misuse.

kubectl get svc <svc> -n <ns> -o yaml | grep -A5 selector
kubectl get pods -n <ns> --show-labels
kubectl get endpoints <svc> -n <ns>
kubectl run debug --rm -it --image=curlimages/curl --restart=Never -- \
  curl -v http://<svc>.<ns>.svc.cluster.local:<port>/

Fix: Align labels and selectors; fix targetPort to container port; ensure readiness passes. See first workloads (Service).

Playbook: DNS failures (NXDOMAIN / timeout)

Symptom: App logs no such host for *.svc.cluster.local; intermittent resolution.

Common causes: Wrong service name or namespace; CoreDNS pods unhealthy; ndots search path issues; NetworkPolicy blocking UDP 53; custom DNS config in Pod spec.

kubectl get pods -n kube-system -l k8s-app=kube-dns
kubectl logs -n kube-system -l k8s-app=kube-dns
kubectl run dns-test --rm -it --image=busybox:1.36 --restart=Never -- \
  nslookup kubernetes.default.svc.cluster.local

Fix: Use FQDN service.namespace.svc.cluster.local; repair CoreDNS; adjust NetworkPolicy; fix dnsPolicy/dnsConfig.

Playbook: Ingress returns 404 / 502 / 504

Symptom: External URL fails; in-cluster Service works via port-forward.

Common causes: No ingress controller installed or running; wrong ingressClassName; TLS secret missing or expired; backend Service has no endpoints; path/rule mismatch; timeout too low for slow upstream.

kubectl get ingress -n <ns>
kubectl describe ingress <name> -n <ns>
kubectl get pods -n ingress-nginx   # or your controller namespace
kubectl get certificate -n <ns>    # cert-manager

Fix: Install/fix controller; align host/path rules; renew certs; increase proxy timeouts. Prevent: cert-manager with alerts 30 days before expiry.

Playbook: NetworkPolicy blocked traffic

Symptom: Worked until policy applied; works from some namespaces only; timeouts with no app error.

Common causes: Egress denied to DNS, API, or database; ingress only from wrong namespace label; CNI does not enforce NetworkPolicy.

kubectl get networkpolicy -n <ns>
# Temporarily test: clone policy in staging with broader rules, narrow down
# Many teams use policy visualization tools or eBPF flow logs

Fix: Add allow rules for required labels/ports (including kube-dns); document expected flows. Prevent: Default-deny only after mapping dependencies.

Playbook: PVC Pending / volume mount failures

Symptom: PVC Pending; Pod Events: failed to mount volume or Multi-Attach error.

Common causes: No StorageClass or provisioner down; zone mismatch; RWO volume attached to second node; CSI driver crash; fsGroup permission issues.

kubectl get pvc,pv,storageclass
kubectl describe pvc <claim> -n <ns>
kubectl get pods -n kube-system | grep -i csi

Fix: Correct StorageClass; ensure single writer for RWO; restart CSI node plugin if stuck; use RWX or shared storage when multiple replicas need disk. Deep dive: PV, PVC, and StorageClass.

Playbook: Deployment rollout stuck

Symptom: kubectl rollout status hangs; old ReplicaSet still serves traffic; ProgressDeadlineExceeded.

Common causes: New Pods never become Ready; maxUnavailable / maxSurge with too few replicas; PDB blocking drain; image pull failure on new version only.

kubectl rollout status deployment/<name> -n <ns>
kubectl describe deployment <name> -n <ns>
kubectl get rs -n <ns> -l app=<app>
kubectl rollout undo deployment/<name> -n <ns>   # emergency

Fix: Fix new Pod template; pause rollout (kubectl rollout pause), fix, resume; undo if bad release. Prevent: Canary or blue-green; automated smoke tests in pipeline.

Playbook: Forbidden (RBAC)

Symptom: CI, operator, or in-cluster app gets 403 Forbidden from API.

kubectl auth can-i create pods --as=system:serviceaccount:<ns>:<sa> -n <ns>
kubectl describe rolebinding,clusterrolebinding -n <ns> | grep -A3 Subjects

Fix: Grant least-privilege Role/RoleBinding; fix wrong ServiceAccount on Pod. Guide: Kubernetes cluster RBAC.

Playbook: Node NotReady / eviction storm

Symptom: Many Pods rescheduling; node NotReady; disk pressure or memory pressure taints.

kubectl describe node <node>
kubectl get pods -A --field-selector spec.nodeName=<node>
# On node (if SSH allowed): journalctl -u kubelet -f

Fix: Free disk (image gc, log rotation); fix kubelet/CNI; cordon/drain bad node; replace hardware. Prevent: Node problem detector alerts; PDBs so one node loss does not take the service down.

Playbook: API server slow or failing

Symptom: kubectl timeouts; controllers lag; etcd alarms.

Common causes: etcd latency or full disk; excessive objects (thousands of Secrets); admission webhook timeout; audit log volume.

Fix: Scale control plane; defragment/compaction per runbook; fix webhook; reduce list-watch churn. Escalate to platform/SRE—this is rarely fixed from a single namespace.

Playbook: Certificate and TLS errors

Symptom: Browser or client certificate expired; Ingress TLS handshake fail; mesh mTLS reject.

kubectl get certificate,certificaterequest -n <ns>
kubectl describe certificate <name> -n <ns>
openssl s_client -connect <host>:443 -servername <host>

Fix: Renew via cert-manager ClusterIssuer; fix DNS-01/HTTP-01 challenge; rotate Istio/mesh certs. Prevent: Alert on cert expiry < 14 days.

Playbook: HPA not scaling

Symptom: Load high but replica count unchanged; FailedGetScale or metrics unavailable.

kubectl describe hpa <name> -n <ns>
kubectl top pods -n <ns>
kubectl get apiservice | grep metrics

Fix: Install metrics-server or Prometheus adapter; set correct scaleTargetRef; define requests so CPU utilization is meaningful; check min/max replicas. See metrics-server in depth and HPA in depth.

Quick reference: status → action

What you see	First action
`Pending`	`describe pod` → scheduler / PVC / quota
`ImagePullBackOff`	Verify image + `imagePullSecrets`
`CrashLoopBackOff`	`logs --previous`
`OOMKilled`	Raise memory limit; profile heap
`CreateContainerConfigError`	Secret/ConfigMap keys
Running, not Ready	Readiness probe + `exec` curl
Service works via port-forward only	Ingress / endpoints / selector
PVC `Pending`	StorageClass + CSI
`Forbidden`	`auth can-i` + bindings
Rollout stuck	`describe deploy` + `rollout undo`

Tools worth adding (after kubectl)

kubectl debug — ephemeral debug containers (copy target namespace/kubeconfig).
stern — multi-pod log tail by label.
k9s — fast cluster navigation.
Prometheus/Grafana — correlate restarts with memory and latency.
Network flow — Cilium Hubble, Calico flow logs when policies are suspect.

← K8s debug (hands-on) · Blog index

Hands-on debug All posts

Blog