K8s by Example: Cluster Autoscaler

Cluster Autoscaler automatically adjusts the number of nodes based on pending pods. It adds nodes when pods can’t be scheduled due to resource constraints, and removes underutilized nodes to save costs. Essential for handling variable workloads efficiently.

terminal

Cluster Autoscaler triggers scale-up when pods are Pending due to insufficient resources. It analyzes why pods can’t be scheduled and adds appropriate nodes.

$ kubectl get pods -A | grep Pending
default   web-abc123    0/1     Pending   0          5m

$ kubectl describe pod web-abc123 | grep -A5 Events
Events:
  Type     Reason            Message
  ----     ------            -------
  Warning  FailedScheduling  0/3 nodes available:
           3 Insufficient cpu.

Check autoscaler status and decisions. The cluster-autoscaler-status ConfigMap shows current state and recent scale events.

$ kubectl get cm cluster-autoscaler-status -n kube-system -o yaml
status: |
  Cluster-autoscaler status at 2024-01-15 10:30:00
  NodeGroups:
    Name: default-pool
    Health: Healthy
    Min: 1  Max: 10  Current: 3
  ScaleUp:
    InProgress: true
    Reason: pod didn't fit on any node

cluster-autoscaler-deployment.yaml

Cluster Autoscaler runs as a Deployment. Key flags: —node-group-auto-discovery finds node pools, —scale-down-enabled allows removing nodes.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: cluster-autoscaler
  namespace: kube-system
spec:
  replicas: 1
  selector:
    matchLabels:
      app: cluster-autoscaler
  template:
    spec:
      containers:
        - name: cluster-autoscaler
          image: registry.k8s.io/autoscaling/cluster-autoscaler:v1.28.0
          command:
            - ./cluster-autoscaler
            - --cloud-provider=aws
            - --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/my-cluster

Configure scale-down behavior. —scale-down-delay-after-add prevents thrashing, —scale-down-unneeded-time sets how long a node must be underutilized.

          command:
            - ./cluster-autoscaler
            - --scale-down-enabled=true
            - --scale-down-delay-after-add=10m
            - --scale-down-unneeded-time=10m
            - --scale-down-utilization-threshold=0.5
            - --max-node-provision-time=15m
            - --balance-similar-node-groups=true

terminal

Debug autoscaler decisions from logs. Look for ScaleUp and ScaleDown events, and reasons why scaling was blocked.

$ kubectl logs -n kube-system -l app=cluster-autoscaler | grep -i scale
Scale-up: setting group size to 5
Scale-down: removing node node-xyz
Pod web-abc123 can't be scheduled: Insufficient cpu
No scale-up: no node group can accommodate pod

Common issues: pod requests too large for any node type, affinity rules blocking scheduling, PDB preventing node drain.

# Pod requests more than max node size
$ kubectl logs ... | grep "no node group"
No scale-up: no node group can accommodate pod
  requesting cpu=8000m, max node has 4000m

# Pod blocked by affinity
Pod can't be scheduled due to node affinity

# Scale-down blocked by PDB
Cannot remove node: PodDisruptionBudget

pod-safe-to-evict.yaml

Prevent pods from blocking scale-down with the safe-to-evict annotation. Use for pods that can be safely terminated (stateless, idempotent).

apiVersion: v1
kind: Pod
metadata:
  name: worker
  annotations:
    cluster-autoscaler.kubernetes.io/safe-to-evict: "true"
spec:
  containers:
    - name: worker
      image: worker:v1

By default, pods with local storage, not backed by a controller, or with restrictive PDBs block scale-down. Use annotations to override.

# Pods that block scale-down by default:
# - Pods with local storage (emptyDir with data)
# - Pods not managed by controller (bare pods)
# - Pods with PDB preventing eviction
# - Pods with "safe-to-evict: false" annotation

# This pod uses emptyDir but is safe to evict
annotations:
  cluster-autoscaler.kubernetes.io/safe-to-evict: "true"

nodepool-priority.yaml

Use node pool priority to prefer cheaper instances. Lower priority pools are used first, falling back to higher priority (more expensive) pools when needed.

# AWS Auto Scaling Group tags
k8s.io/cluster-autoscaler/node-template/label/node-type: spot
k8s.io/cluster-autoscaler/enabled: "true"

# Autoscaler expander: priority
# Define priority in ConfigMap:
apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-autoscaler-priority-expander
  namespace: kube-system
data:
  priorities: |
    10:
      - spot-pool
    50:
      - on-demand-pool

Configure expander strategy to control which node pool is scaled. Options: random, most-pods, least-waste, priority.

# Use priority expander for cost optimization
--expander=priority

# Or least-waste for efficient bin-packing
--expander=least-waste

# Check current expander
$ kubectl get deploy cluster-autoscaler -n kube-system \
    -o jsonpath='{.spec.template.spec.containers[0].command}' \
    | tr ',' '\n' | grep expander

terminal

Mix spot and on-demand instances for cost savings. Spot instances are cheaper but can be terminated. Use for stateless, fault-tolerant workloads.

$ kubectl get nodes --show-labels | grep -E "spot|on-demand"
node-spot-1     node-type=spot
node-spot-2     node-type=spot
node-ondemand-1 node-type=on-demand

# Schedule critical workloads on on-demand
nodeSelector:
  node-type: on-demand

# Allow batch jobs on spot
nodeSelector:
  node-type: spot
tolerations:
  - key: "spot-instance"
    operator: "Exists"

terminal

Debug why pods aren’t triggering scale-up. Check if pods have node selectors, affinities, or tolerations that can’t be satisfied by any node pool.

# Check pod constraints
$ kubectl get pod pending-pod -o yaml | grep -A10 nodeSelector
$ kubectl get pod pending-pod -o yaml | grep -A20 affinity

# Check node pool templates
$ kubectl get cm cluster-autoscaler-status -n kube-system
NodeGroups:
  Name: gpu-pool
    Labels: accelerator=nvidia-tesla-t4
    Min: 0  Max: 5  Current: 0

# Pod needs GPU but label doesn't match

Index | Use arrow keys to navigate