GitHub EN PT

K8s by Example: Pod Priority & Preemption

Pod Priority determines scheduling order and preemption rights. Higher priority pods are scheduled first and can evict lower priority pods when resources are scarce. Use to protect critical workloads and enable graceful degradation under resource pressure.

priorityclass.yaml

PriorityClass defines a priority level. Higher value = higher priority. Range: -2 billion to 1 billion. System priorities use values above 1 billion.

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority
value: 1000000
globalDefault: false
description: "Critical production workloads"

preemptionPolicy controls whether this priority can evict others. PreemptLowerPriority (default) allows eviction. Never disables it.

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority-no-preempt
value: 1000000
preemptionPolicy: Never      # Won't evict other pods
description: "High priority but won't preempt"
priority-tiers.yaml

Create a priority tier system for your cluster. Common pattern: system-critical > production > staging > batch. Leave gaps between values for future tiers.

# Tier 1: System critical (never evict)
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: system-critical
value: 2000000
---
# Tier 2: Production workloads
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: production
value: 1000000
---
# Tier 3: Staging/testing
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: staging
value: 500000
---
# Tier 4: Batch/background jobs
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: batch
value: 100000
deployment-priority.yaml

Reference the PriorityClass in your Pod spec. The scheduler uses this to order pending pods and determine preemption eligibility.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-gateway
spec:
  replicas: 3
  selector:
    matchLabels:
      app: api-gateway
  template:
    metadata:
      labels:
        app: api-gateway
    spec:
      priorityClassName: production    # High priority
      containers:
        - name: gateway
          image: api-gateway:v1
          resources:
            requests:
              cpu: "500m"
              memory: "512Mi"
terminal

View all priority classes including built-in system priorities. system-cluster-critical and system-node-critical are reserved for essential system pods.

$ kubectl get priorityclasses
NAME                      VALUE        GLOBAL-DEFAULT
system-cluster-critical   2000000000   false
system-node-critical      2000001000   false
production                1000000      false
staging                   500000       false
batch                     100000       true   # Default

Check which priority class a pod is using. Pods without explicit priorityClassName use the globalDefault or default to 0.

$ kubectl get pods -o custom-columns=\
    NAME:.metadata.name,\
    PRIORITY:.spec.priorityClassName,\
    VALUE:.spec.priority
NAME          PRIORITY     VALUE
api-gateway   production   1000000
worker-job    batch        100000
misc-pod      <none>       0
terminal

When preemption occurs, lower-priority pods are evicted to make room. Events show which pods were preempted and why.

$ kubectl get events | grep -i preempt
1m   Normal   Preempted   pod/worker-job-123
     Preempted by prod/api-gateway-456 on node worker-1

$ kubectl describe pod api-gateway-456 | grep -A5 Events
Events:
  Type    Reason     Message
  ----    ------     -------
  Normal  Scheduled  Preempting 2 pods: [worker-job-123, batch-456]
  Normal  Pulling    Pulling image "api-gateway:v1"

Preempted pods are terminated gracefully (SIGTERM, grace period). They appear with Reason: Preempting in events.

$ kubectl get pods
NAME              READY   STATUS        AGE
api-gateway-456   1/1     Running       30s
worker-job-123    1/1     Terminating   5m   # Being preempted

$ kubectl describe pod worker-job-123 | grep -A3 Status
Status:       Terminating
Reason:       Preempting
Message:      Preempted by higher priority pod
pdb-with-priority.yaml

PodDisruptionBudgets interact with preemption. A PDB can prevent preemption if it would violate the disruption budget. Use PDB to protect critical services.

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api-gateway-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: api-gateway
# If preempting api-gateway would drop below
# 2 available pods, preemption is blocked
# The pending high-priority pod stays Pending

Combine priority with PDB for layered protection. High priority prevents eviction by lower priority pods. PDB prevents eviction by any voluntary disruption.

# Critical service: high priority + strict PDB
priorityClassName: production
---
apiVersion: policy/v1
kind: PodDisruptionBudget
spec:
  minAvailable: "50%"
  selector:
    matchLabels:
      app: critical-service
avoid-cascade.yaml

Warning: Priority misconfiguration causes cascading evictions. If many high-priority pods become Pending, they can evict entire workloads. Use ResourceQuotas to limit high-priority pod counts.

# Limit high-priority pods per namespace
apiVersion: v1
kind: ResourceQuota
metadata:
  name: priority-quota
  namespace: production
spec:
  hard:
    pods: "100"
  scopeSelector:
    matchExpressions:
      - operator: In
        scopeName: PriorityClass
        values:
          - production
          - system-critical
# Max 100 high-priority pods in this namespace
# Prevents runaway high-priority deployments

Index | Use arrow keys to navigate