K8s by Example: etcd Basics
| etcd is Kubernetes’ distributed key-value store for all cluster state: pods, services, secrets, configmaps. If etcd is slow or unavailable, the entire control plane stops working. Understanding etcd helps debug mysterious API server issues and prevent data loss. |
| terminal | |
| etcd runs as a static pod on control plane nodes. In HA clusters, you have 3 or 5 etcd members for fault tolerance (requires majority for quorum). | |
| Check etcd logs for errors. Common issues: leader elections, slow disk, peer communication failures. | |
| terminal | |
| Use | |
| Check member list and leader status. In a healthy cluster, exactly one member is the leader. | |
| terminal | |
| Critical: Back up etcd regularly. This is your only way to recover cluster state after catastrophic failure. Automate this with a CronJob or external backup system. | |
| Verify backup integrity. Store backups off-cluster (S3, GCS, etc.). Test restoration regularly. | |
| terminal | |
| Restore from backup during disaster recovery. This creates a new data directory. You must restore on all members and restart them. | |
| terminal | |
| etcd database grows over time and needs defragmentation. Defrag reclaims disk space from deleted keys. Schedule during maintenance windows as it briefly blocks writes. | |
| Also compact old revisions to prevent unbounded growth. Kubernetes API server usually handles this automatically. | |
| terminal | |
| Monitor etcd performance via API server metrics. Key metrics: | |
| High etcd latency causes API server slowness. Common causes: slow disk (use SSD), network latency between members, large values (secrets/configmaps). | |
| etcd-performance.yaml | |
| etcd performance requirements: use SSDs (not spinning disks), ensure low-latency network between members, allocate sufficient memory. etcd is CPU and I/O intensive. | |
Index | Use arrow keys to navigate