How-tos
Kubernetes Troubleshooting Techniques Every Engineer Should Know
A systematic guide to diagnosing and fixing common Kubernetes issues using kubectl, events, logs, networking tests, and live metrics — from crash loops to silent quota failures.
June 2026 · 8 min read · 4 views · 0 hearts
Advertisement
Kubernetes Troubleshooting Techniques Every Engineer Should Know
You’ve deployed your app to Kubernetes, and everything is running smoothly—until it isn’t. A pod crashes, a service stops responding, or your ingress stops resolving. Panic sets in, but it doesn’t have to.
Kubernetes troubleshooting isn't magic. It’s a methodical, predictable process. Here’s the toolkit every engineer should have ready.
1. Start with kubectl get events
Before you dive into logs or exec into a container, check events. Events are Kubernetes’ own internal log of what’s been happening in a namespace. They often reveal the root cause quickly—resource limits, image pull failures, or node pressure.
kubectl get events --sort-by='.lastTimestamp'
This single command can save you 20 minutes of digging. Look for FailedMount, BackOff, or OOMKilled. If a pod keeps crashing, the event will show you exactly why.
2. Describe everything
kubectl describe is your second-best friend. It gives you the complete state of a resource: labels, conditions, volumes, and recent events.
kubectl describe pod <pod-name>
Pay special attention to:
- Conditions – Is the pod Ready? Is Initialized true?
- Container Status – Waiting, Running, or Terminated? If terminated, the exit code tells you if it was an application error (137 = OOM, 1 = app crash).
- Volumes – Did a PVC mount fail? Often a typo in a storage class name.
3. Check logs like a detective
kubectl logs is obvious, but most people stop there. If you have multiple containers in a pod, use the -c flag:
kubectl logs <pod-name> -c <container-name>
For a crash-looping pod, add --previous to see logs from the previous attempt:
kubectl logs <pod-name> --previous
This is gold. Your app might be printing the error on startup before it dies—you just need the last run’s output.
4. Diagnose networking with a temporary pod
When a service isn’t reachable, don’t guess. Spin up a temporary pod in the same namespace and test connectivity:
kubectl run -it --rm debug --image=busybox -- sh
Inside, try:
- wget <service-name>.svc.cluster.local
- nslookup kubernetes.default
- telnet <service-name> 80
This isolates the problem to the network layer, not your application’s export port or ingress config.
5. Resource quotas: the silent killer
Your pod might never start if a namespace has resource quotas. Check with:
kubectl describe quota -n <namespace>
Look for Exceeded status. If the quota is 4 pods and you already have 4 running, your new pod will stay Pending. Event logs will say “failed quota: pods”.
6. Use kubectl top for live metrics
When a pod is running but slow, don’t guess. Check real-time CPU and memory usage:
kubectl top pod
kubectl top node
If CPU usage is near 100% or memory is climbing steadily, you’re probably seeing a resource leak or insufficient limits. Compare with your resources.requests and resources.limits in the pod spec.
7. Debug with kubectl exec and kubectl cp
Sometimes you need to look inside a running container—not just its logs.
kubectl exec -it <pod-name> -- /bin/sh
Inside, check:
- Environment variables: env | grep -i db_host
- Files: ls /app/config
- Processes: ps aux
If you need to copy a file out for analysis:
kubectl cp <pod-name>:/app/log/debug.log ./debug.log
This beats scrolling through logs when you need to inspect a configuration file or a dump.
8. The kubectl context trap
One of the most common “troubleshooting” issues is checking the wrong cluster. Always verify your context:
kubectl config current-context
kubectl config get-contexts
If you’re debugging a crash in production but your context points to staging, you’ll waste hours. I’ve done it. Everyone has.
9. Worst case: kubectl delete pod (it’s fine)
If a pod is stuck in Terminating or CrashLoopBackOff and you’ve exhausted everything, force delete it:
kubectl delete pod <pod-name> --grace-period=0 --force
Then let the controller recreate it. This forces a fresh state without affecting the deployment’s replica count. It’s safe—the controller will build a new pod from scratch.
10. When nothing works: restart the kubelet (on the node)
If you’re managing nodes directly and see NodeNotReady or persistent pod scheduling failures, sometimes the kubelet gets wedged:
sudo systemctl restart kubelet
Wait 30–60 seconds, then check kubectl get nodes. This should be a last resort—it restarts all pods on that node—but it’s a valid troubleshooting step for edge cases.
The real trick: be systematic
Kubernetes failures tend to follow patterns: networking issues, resource exhaustion, wrong container image, or misconfigured RBAC. Approach each problem in order—events → describe → logs → exec → networking test. That sequence will solve 90% of issues before they become incidents.
Keep this list in your notes or your IDE’s scratch file. Next time a pod goes down, you’ll already know where to click.
Advertisement
Comments
Questions, corrections, and tips stay visible for everyone reading this page.
Join the discussion
No comments yet
Be the first to leave a note — it helps the next reader.