Debug Like an Operator, Then Choose Your Next Path
Clusters fail in predictable categories: scheduling, image pull, crash loop, configuration, networking. A short debugging order and a handful of commands will carry you through most beginner incidents on kind, k3s, or minikube—and the same order works in production.
In short
Use get → describe → logs → events. Fix ImagePullBackOff, CrashLoopBackOff, and probe failures with intent. Then deepen with GitOps, observability, and structured certification practice.
The debugging order (memorize this)
kubectl get pods -n <ns>— phase and restarts at a glance.kubectl describe pod <name> -n <ns>— events, node, probes, volumes.kubectl logs <name> -n <ns> [--previous]— stdout/stderr;--previousfor crashed containers.kubectl get events -n <ns> --sort-by='.lastTimestamp'— timeline when describe is noisy.
For Deployments add kubectl describe deployment and kubectl rollout status. For Services add kubectl get endpoints and verify selectors.
Common Pod states and what they mean
| Symptom | Likely cause | What to do |
|---|---|---|
Pending | No node capacity, taints, PVC not bound | describe pod → Events; get nodes |
ImagePullBackOff | Wrong name/tag, private registry auth | Fix image; create secret docker-registry if private |
CrashLoopBackOff | App exits on start, bad command, missing config | logs --previous; run image locally with same command |
CreateContainerConfigError | Missing Secret/ConfigMap key | describe pod; verify referenced objects exist |
| Running but not Ready | Readiness probe failing | Check probe path/port; test inside Pod with kubectl exec |
Interactive debugging
kubectl exec -it deploy/web -n learn-dev -- sh
# inside: wget -qO- http://127.0.0.1/ OR apk/curl depending on image
kubectl run tmp-curl --rm -it --image=curlimages/curl --restart=Never -- \
curl -s http://web.learn-dev.svc.cluster.local/
DNS names follow <service>.<namespace>.svc.cluster.local. If curl from another Pod fails, you have a Service or NetworkPolicy problem—not “the internet is down.”
When port-forward works but Ingress does not
Ingress needs an ingress controller (nginx, traefik, etc.). Local clusters often do not install one by default. For learning, port-forward and minikube tunnel (when documented for your driver) are enough. Treat Ingress as a follow-on topic after Services make sense.
Observability: the next layer
- Metrics:
kubectl top pods(requires metrics-server on many clusters)—see metrics-server in depth. - Dashboards: Prometheus/Grafana in a later lab—not required on day one.
- Tracing: OpenTelemetry when you operate microservices at scale.
The architecture post’s incident mental model—API → nodes → schedule → container → network—still applies; metrics tell you where in that chain to look.
Where to go after this series
- Production playbook: Kubernetes troubleshooting playbook — symptom-by-symptom fixes for on-call.
- GitOps: Git as the control plane — store manifests in Git, automate sync.
- Platform context: DevOps life and business value, cloud platform evolution.
- Structured practice: Kubernetes official tutorials; CKA/CKAD-style tasks (multi-object YAML under time pressure).
- Production topics: Ingress, PersistentVolumes, StatefulSets, Helm/Kustomize, network policies, pod disruption budgets.
Series recap
- Local lab — kind, k3s, or minikube.
- YAML anatomy — apiVersion, kind, metadata, spec, labels.
- First workloads — Deployment and Service.
- Day-one practices — namespaces, labels, resources, security.
- This post — debug and roadmap.
You now have a loop: declare in YAML → apply → observe → fix → commit. That loop is the job—whether the cluster lives on your laptop or in three regions.