K8s by Example: Debugging Pods

When a Pod fails, work through this systematic checklist: check status, read events, inspect logs, then get inside. Most issues fall into three categories: image problems, application crashes, or resource constraints.

terminal

Start here. Status tells you where to look next. ImagePullBackOff = image problem. CrashLoopBackOff = app crashes at startup. Pending = scheduling issue. Running but not ready = probe failing.

$ kubectl get pods -o wide
NAME        READY   STATUS             RESTARTS   AGE   NODE
api-6f7d8   0/1     ImagePullBackOff   0          2m    worker-1
db-8k2x9    0/1     CrashLoopBackOff   5          10m   worker-2
web-3n4m5   0/1     Pending            0          5m    <none>
cache-9z8y  1/1     Running            0          1h    worker-1

terminal

Events section at the bottom is gold. It shows the sequence of what happened. For ImagePullBackOff, you’ll see the exact error - wrong tag, private registry auth, or network issues.

$ kubectl describe pod api-6f7d8 | tail -15
Events:
  Type     Reason     Age                From               Message
  ----     ------     ----               ----               -------
  Normal   Scheduled  3m                 default-scheduler  Assigned to worker-1
  Normal   Pulling    90s (x4 over 3m)   kubelet            Pulling image "api:v2.1"
  Warning  Failed     88s (x4 over 3m)   kubelet            Failed to pull image "api:v2.1": rpc error: code = NotFound
  Warning  Failed     88s (x4 over 3m)   kubelet            Error: ErrImagePull
  Normal   BackOff    75s (x6 over 3m)   kubelet            Back-off pulling image "api:v2.1"
  Warning  Failed     75s (x6 over 3m)   kubelet            Error: ImagePullBackOff

terminal

For CrashLoopBackOff, the app started but exited. —previous shows logs from the last crashed container. Without it, you might see an empty log if the new container hasn’t written anything yet.

$ kubectl logs db-8k2x9 --previous
2024-01-15 10:30:00 Starting database...
2024-01-15 10:30:01 Loading config from /etc/db/config.yaml
2024-01-15 10:30:01 ERROR: Config file not found
2024-01-15 10:30:01 Exiting with code 1

$ kubectl logs db-8k2x9 -f
$ kubectl logs db-8k2x9 --since=5m
$ kubectl logs db-8k2x9 --tail=100

terminal

Pending means the scheduler can’t place the Pod. Check events for why: insufficient CPU/memory, no nodes match affinity rules, or all nodes are tainted. The NODE column shows <none>.

$ kubectl describe pod web-3n4m5 | tail -10
Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  5m    default-scheduler  0/3 nodes are available:
           1 Insufficient cpu, 2 node(s) had taint
           {node-role.kubernetes.io/master: }, that the pod didn't tolerate.

$ kubectl describe nodes | grep -A5 "Allocated resources"

terminal

Pod is Running but something’s wrong? Get inside. Check if the process is listening, test network connectivity, verify mounted files exist. Use -c container-name for multi-container Pods.

$ kubectl exec -it cache-9z8y -- sh

/ $ ps aux
PID   USER     COMMAND
1     redis    redis-server *:6379

/ $ netstat -tlnp
Proto Local Address   State       PID/Program
tcp   0.0.0.0:6379   LISTEN      1/redis-server

/ $ nslookup api-service
Server:    10.96.0.10
Address:   10.96.0.10:53
Name:      api-service.default.svc.cluster.local
Address:   10.100.45.23

/ $ wget -qO- http://api-service:8080/health

terminal

Can’t exec because container keeps crashing? Use ephemeral debug containers. This attaches a new container to the Pod’s namespaces so you can inspect the environment even when the main container won’t start.

$ kubectl debug api-6f7d8 -it --image=busybox --target=api
Targeting container "api". If you don't see processes, try --share-processes
/ $

/ $ ls -la /etc/config/
total 12
drwxr-xr-x    2 root     root          4096 Jan 15 10:30 .
-rw-r--r--    1 root     root           156 Jan 15 10:30 database.conf

/ $ env | grep DATABASE
DATABASE_URL=postgres://db:5432/myapp

terminal

Debug at the node level. Useful for checking kubelet logs, disk space, or network issues that affect all Pods on a node. The debug Pod gets host PID/network namespace access.

$ kubectl debug node/worker-1 -it --image=busybox
Creating debugging pod node-debugger-worker-1-abc123

/ $ chroot /host

$ journalctl -u kubelet --since "10 minutes ago" | tail -50

$ df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1       100G   95G    5G  95% /

$ crictl ps -a | grep -i exited

terminal

Network debugging needs tools most app images don’t have. netshoot has curl, dig, nmap, tcpdump, and more. Great for testing Service connectivity, DNS, and network policies.

$ kubectl run debug --rm -it --image=nicolaka/netshoot -- bash

bash-5.1$ dig +short api-service.default.svc.cluster.local
10.100.45.23

bash-5.1$ nc -zv api-service 8080
Connection to api-service 8080 port [tcp/*] succeeded!

bash-5.1$ curl -v http://api-service:8080/health

bash-5.1$ mtr --report api-service

bash-5.1$ tcpdump -i eth0 port 8080 -w /tmp/capture.pcap

terminal

Cluster-wide view of what’s happening. Events disappear after 1 hour by default. Filter by namespace, type, or involved object. Warnings are usually the interesting ones.

$ kubectl get events --sort-by='.lastTimestamp' | tail -20

$ kubectl get events --field-selector type=Warning

$ kubectl get events --field-selector involvedObject.name=api-6f7d8

$ kubectl get events -w
0s    Warning   FailedMount   pod/api-6f7d8   MountVolume.SetUp failed
0s    Normal    Pulled        pod/db-8k2x9    Container image pulled

terminal

Resource issues cause OOMKills and throttling. top shows actual usage (requires metrics-server). Compare against limits. High restart count often means OOMKilled.

$ kubectl top pods --sort-by=memory
NAME        CPU(cores)   MEMORY(bytes)
api-6f7d8   250m         512Mi
db-8k2x9    100m         1024Mi
web-3n4m5   50m          128Mi

$ kubectl get pod db-8k2x9 -o jsonpath='{.status.containerStatuses[0].lastState.terminated.reason}'
OOMKilled

$ kubectl describe pod api-6f7d8 | grep -A2 Limits
    Limits:
      cpu:     500m
      memory:  256Mi

terminal

Need to connect to a Pod directly? Port-forward creates a tunnel from localhost. Useful for database access, testing endpoints, or connecting debuggers. Stays open until Ctrl+C.

$ kubectl port-forward pod/db-8k2x9 5432:5432 &
Forwarding from 127.0.0.1:5432 -> 5432

$ psql -h localhost -p 5432 -U admin mydb

$ kubectl port-forward svc/api-service 8080:80

$ kubectl port-forward pod/api-6f7d8 8080:80 9090:9090

Index | Use arrow keys to navigate