Maintenance

Site is under maintenance — quizzes are still available.

Go to quizzes
Sponsored Reserved space — layout preview until AdSense is connected

How-tos

Kubernetes Troubleshooting Techniques Every Engineer Should Know

A systematic guide to diagnosing and fixing common Kubernetes issues using kubectl, events, logs, networking tests, and live metrics — from crash loops to silent quota failures.

June 2026 · 8 min read · 4 views · 0 hearts

Kubernetes Troubleshooting Techniques Every Engineer Should Know

You’ve deployed your app to Kubernetes, and everything is running smoothly—until it isn’t. A pod crashes, a service stops responding, or your ingress stops resolving. Panic sets in, but it doesn’t have to.

Kubernetes troubleshooting isn't magic. It’s a methodical, predictable process. Here’s the toolkit every engineer should have ready.

1. Start with kubectl get events

Before you dive into logs or exec into a container, check events. Events are Kubernetes’ own internal log of what’s been happening in a namespace. They often reveal the root cause quickly—resource limits, image pull failures, or node pressure.

kubectl get events --sort-by='.lastTimestamp'

This single command can save you 20 minutes of digging. Look for FailedMount, BackOff, or OOMKilled. If a pod keeps crashing, the event will show you exactly why.

2. Describe everything

kubectl describe is your second-best friend. It gives you the complete state of a resource: labels, conditions, volumes, and recent events.

kubectl describe pod <pod-name>

Pay special attention to: - Conditions – Is the pod Ready? Is Initialized true? - Container StatusWaiting, Running, or Terminated? If terminated, the exit code tells you if it was an application error (137 = OOM, 1 = app crash). - Volumes – Did a PVC mount fail? Often a typo in a storage class name.

3. Check logs like a detective

kubectl logs is obvious, but most people stop there. If you have multiple containers in a pod, use the -c flag:

kubectl logs <pod-name> -c <container-name>

For a crash-looping pod, add --previous to see logs from the previous attempt:

kubectl logs <pod-name> --previous

This is gold. Your app might be printing the error on startup before it dies—you just need the last run’s output.

4. Diagnose networking with a temporary pod

When a service isn’t reachable, don’t guess. Spin up a temporary pod in the same namespace and test connectivity:

kubectl run -it --rm debug --image=busybox -- sh

Inside, try: - wget <service-name>.svc.cluster.local - nslookup kubernetes.default - telnet <service-name> 80

This isolates the problem to the network layer, not your application’s export port or ingress config.

5. Resource quotas: the silent killer

Your pod might never start if a namespace has resource quotas. Check with:

kubectl describe quota -n <namespace>

Look for Exceeded status. If the quota is 4 pods and you already have 4 running, your new pod will stay Pending. Event logs will say “failed quota: pods”.

6. Use kubectl top for live metrics

When a pod is running but slow, don’t guess. Check real-time CPU and memory usage:

kubectl top pod
kubectl top node

If CPU usage is near 100% or memory is climbing steadily, you’re probably seeing a resource leak or insufficient limits. Compare with your resources.requests and resources.limits in the pod spec.

7. Debug with kubectl exec and kubectl cp

Sometimes you need to look inside a running container—not just its logs.

kubectl exec -it <pod-name> -- /bin/sh

Inside, check: - Environment variables: env | grep -i db_host - Files: ls /app/config - Processes: ps aux

If you need to copy a file out for analysis:

kubectl cp <pod-name>:/app/log/debug.log ./debug.log

This beats scrolling through logs when you need to inspect a configuration file or a dump.

8. The kubectl context trap

One of the most common “troubleshooting” issues is checking the wrong cluster. Always verify your context:

kubectl config current-context
kubectl config get-contexts

If you’re debugging a crash in production but your context points to staging, you’ll waste hours. I’ve done it. Everyone has.

9. Worst case: kubectl delete pod (it’s fine)

If a pod is stuck in Terminating or CrashLoopBackOff and you’ve exhausted everything, force delete it:

kubectl delete pod <pod-name> --grace-period=0 --force

Then let the controller recreate it. This forces a fresh state without affecting the deployment’s replica count. It’s safe—the controller will build a new pod from scratch.

10. When nothing works: restart the kubelet (on the node)

If you’re managing nodes directly and see NodeNotReady or persistent pod scheduling failures, sometimes the kubelet gets wedged:

sudo systemctl restart kubelet

Wait 30–60 seconds, then check kubectl get nodes. This should be a last resort—it restarts all pods on that node—but it’s a valid troubleshooting step for edge cases.

The real trick: be systematic

Kubernetes failures tend to follow patterns: networking issues, resource exhaustion, wrong container image, or misconfigured RBAC. Approach each problem in order—events → describe → logs → exec → networking test. That sequence will solve 90% of issues before they become incidents.

Keep this list in your notes or your IDE’s scratch file. Next time a pod goes down, you’ll already know where to click.

Comments

Questions, corrections, and tips stay visible for everyone reading this page.

0 in thread

Join the discussion

Shown next to your comment.

Up to 4,000 characters

No comments yet

Be the first to leave a note — it helps the next reader.