K8s by Example: etcd Basics

etcd is Kubernetes’ distributed key-value store for all cluster state: pods, services, secrets, configmaps. If etcd is slow or unavailable, the entire control plane stops working. Understanding etcd helps debug mysterious API server issues and prevent data loss.

terminal

etcd runs as a static pod on control plane nodes. In HA clusters, you have 3 or 5 etcd members for fault tolerance (requires majority for quorum).

$ kubectl get pods -n kube-system -l component=etcd
NAME                      READY   STATUS    RESTARTS   AGE
etcd-control-plane-1      1/1     Running   0          30d
etcd-control-plane-2      1/1     Running   0          30d
etcd-control-plane-3      1/1     Running   0          30d

Check etcd logs for errors. Common issues: leader elections, slow disk, peer communication failures.

$ kubectl logs -n kube-system etcd-control-plane-1 | tail -20
elected leader at term 123
synced raft log entries
WARNING: slow fdatasync (500ms)    # Disk too slow
rejected connection from peer      # Network issue

terminal

Use etcdctl to check cluster health. This requires certificates for authentication. Most managed Kubernetes don’t expose etcd directly.

$ ETCDCTL_API=3 etcdctl \
    --endpoints=https://127.0.0.1:2379 \
    --cacert=/etc/kubernetes/pki/etcd/ca.crt \
    --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
    --key=/etc/kubernetes/pki/etcd/healthcheck-client.key \
    endpoint health

https://127.0.0.1:2379 is healthy

Check member list and leader status. In a healthy cluster, exactly one member is the leader.

$ etcdctl member list --write-out=table
+------------------+-------+----------+
|        ID        |STATUS |   NAME   |
+------------------+-------+----------+
| 8e9e05c52164694d |started|cp-1      |
| 91bc3c398fb3c146 |started|cp-2      |
| fd422379fda50e48 |started|cp-3 (L)  | # (L) = Leader
+------------------+-------+----------+

terminal

Critical: Back up etcd regularly. This is your only way to recover cluster state after catastrophic failure. Automate this with a CronJob or external backup system.

$ etcdctl snapshot save /backup/etcd-$(date +%Y%m%d).db \
    --endpoints=https://127.0.0.1:2379 \
    --cacert=/etc/kubernetes/pki/etcd/ca.crt \
    --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
    --key=/etc/kubernetes/pki/etcd/healthcheck-client.key

Snapshot saved at /backup/etcd-20240115.db

Verify backup integrity. Store backups off-cluster (S3, GCS, etc.). Test restoration regularly.

$ etcdctl snapshot status /backup/etcd-20240115.db --write-out=table
+----------+----------+------------+------------+
|   HASH   | REVISION | TOTAL KEYS | TOTAL SIZE |
+----------+----------+------------+------------+
| 5c4b94b7 |  1234567 |      15423 |     50 MB  |
+----------+----------+------------+------------+

# Copy to off-cluster storage
$ aws s3 cp /backup/etcd-20240115.db s3://my-backups/

terminal

Restore from backup during disaster recovery. This creates a new data directory. You must restore on all members and restart them.

# Stop kube-apiserver first
$ systemctl stop kubelet

# Restore snapshot
$ etcdctl snapshot restore /backup/etcd-20240115.db \
    --data-dir=/var/lib/etcd-restored \
    --name=cp-1 \
    --initial-cluster=cp-1=https://10.0.0.1:2380,cp-2=https://10.0.0.2:2380 \
    --initial-advertise-peer-urls=https://10.0.0.1:2380

# Update etcd manifest to use new data-dir
$ vi /etc/kubernetes/manifests/etcd.yaml
# Change: --data-dir=/var/lib/etcd-restored

# Restart kubelet
$ systemctl start kubelet

terminal

etcd database grows over time and needs defragmentation. Defrag reclaims disk space from deleted keys. Schedule during maintenance windows as it briefly blocks writes.

# Check database size
$ etcdctl endpoint status --write-out=table
+------------------------+--------+----------+
|       ENDPOINT         | DB SIZE| IS LEADER|
+------------------------+--------+----------+
| https://127.0.0.1:2379 | 150 MB |   true   |
+------------------------+--------+----------+

# Defragment (run on each member)
$ etcdctl defrag --endpoints=https://127.0.0.1:2379
Finished defragmenting etcd member

Also compact old revisions to prevent unbounded growth. Kubernetes API server usually handles this automatically.

# Get current revision
$ etcdctl endpoint status --write-out=json | jq '.[0].Status.header.revision'
1234567

# Compact old revisions (keep last 10000)
$ etcdctl compact 1224567

# Then defrag to reclaim space
$ etcdctl defrag

terminal

Monitor etcd performance via API server metrics. Key metrics: etcd_request_duration_seconds, etcd_db_total_size_in_bytes. Slow etcd = slow API server.

# Check etcd latency from API server perspective
$ kubectl get --raw /metrics | grep etcd_request_duration
etcd_request_duration_seconds_bucket{operation="get",le="0.005"} 123456
etcd_request_duration_seconds_bucket{operation="get",le="0.5"} 234567
# Most GETs should complete in <5ms

High etcd latency causes API server slowness. Common causes: slow disk (use SSD), network latency between members, large values (secrets/configmaps).

# Prometheus queries for etcd health
# Request latency (should be <100ms)
histogram_quantile(0.99,
  rate(etcd_request_duration_seconds_bucket[5m]))

# Database size (alert if approaching quota)
etcd_mvcc_db_total_size_in_bytes / 1024 / 1024

# Leader changes (should be rare)
rate(etcd_server_leader_changes_seen_total[1h])

etcd-performance.yaml

etcd performance requirements: use SSDs (not spinning disks), ensure low-latency network between members, allocate sufficient memory. etcd is CPU and I/O intensive.

# etcd node requirements:
# - SSD storage (>50 IOPS/GB, >500MB/s)
# - <10ms network latency between members
# - 2-4 CPU cores dedicated
# - 8GB+ RAM for large clusters
# - Separate disk from OS (reduces I/O contention)

# Check disk performance
$ fio --name=test --size=1G --runtime=60 \
    --filename=/var/lib/etcd/fio-test \
    --ioengine=sync --rw=write --fsync=1

# Should see >500 IOPS for fdatasync
# etcd warns if fdatasync takes >100ms

Index | Use arrow keys to navigate