GitHub EN PT

K8s by Example: Production Checklist

A Deployment that works in dev often breaks in prod. This checklist covers what you need: resource constraints, health checks, graceful shutdown, and high availability.

1-resources.yaml

Set resource requests and limits. Without requests, the scheduler can’t make good decisions. Your Pod might land on an overloaded node. Without limits, a memory leak takes down the entire node.

resources:
  requests:
    cpu: 100m
    memory: 256Mi
  limits:
    cpu: 500m
    memory: 256Mi

Set memory limit equal to request. This creates Guaranteed QoS class and makes scheduling predictable. CPU can burst (set limit higher than request), but memory should match to avoid OOMKill surprises when pods use more than requested.

resources:
  requests:
    cpu: 100m
    memory: 512Mi
  limits:
    cpu: 1
    memory: 512Mi
2-probes.yaml

Configure readiness probe. Without it, traffic hits Pods before they’re ready. During rolling updates, users see errors. The probe should check if your app can actually serve requests, not just if the process is alive.

readinessProbe:
  httpGet:
    path: /health/ready
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 5
  failureThreshold: 3

Configure liveness probe. Restarts stuck processes. Be careful: if your liveness check is too aggressive, you’ll restart healthy Pods under load. Make it simpler than readiness - just “is the process responding at all?”

livenessProbe:
  httpGet:
    path: /health/live
    port: 8080
  initialDelaySeconds: 15
  periodSeconds: 10
  failureThreshold: 3
3-shutdown.yaml

Handle SIGTERM. Kubernetes sends SIGTERM, waits terminationGracePeriodSeconds, then SIGKILL. Your app must catch SIGTERM and finish in-flight requests. 30s default is often not enough for long-running operations.

spec:
  terminationGracePeriodSeconds: 60
  containers:
    - name: api
      lifecycle:
        preStop:
          exec:
            command: ["/bin/sh", "-c", "sleep 5"]

The preStop sleep trick: Load balancers take a few seconds to remove endpoints. Without the sleep, requests hit terminating Pods. The sleep gives the LB time to update, then your app handles SIGTERM cleanly.

process.on('SIGTERM', async () => {
  console.log('SIGTERM received, finishing requests...');
  server.close(async () => {
    await db.disconnect();
    process.exit(0);
  });
});
4-replicas.yaml

Run multiple replicas. Single replica = single point of failure. During node maintenance, deploys, or crashes - you’re down. Minimum 2 for any service that matters. 3+ for critical paths.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api
spec:
  replicas: 3
  selector:
    matchLabels:
      app: api
5-anti-affinity.yaml

Spread Pods across nodes. 3 replicas on 1 node = still a single point of failure. Anti-affinity ensures Pods land on different nodes. Use preferredDuringScheduling so Pods can still schedule if nodes are limited.

spec:
  template:
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              podAffinityTerm:
                labelSelector:
                  matchLabels:
                    app: api
                topologyKey: kubernetes.io/hostname
6-pdb.yaml

Set Pod Disruption Budget. Without PDB, kubectl drain can evict all your Pods at once during maintenance. PDB guarantees minimum availability. maxUnavailable: 1 means only 1 Pod can be down at a time.

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api-pdb
spec:
  maxUnavailable: 1
  selector:
    matchLabels:
      app: api
7-security.yaml

Drop root privileges. Running as root inside the container is unnecessary for most apps and dangerous if compromised. Set runAsNonRoot: true and specify a user ID.

spec:
  template:
    spec:
      securityContext:
        runAsNonRoot: true
        runAsUser: 1000
        fsGroup: 1000

Make filesystem read-only. Prevents attackers from dropping malware. Use emptyDir for directories that need writes (tmp, cache). Drop all capabilities your app doesn’t need.

      containers:
        - name: api
          securityContext:
            readOnlyRootFilesystem: true
            allowPrivilegeEscalation: false
            capabilities:
              drop: ["ALL"]
          volumeMounts:
            - name: tmp
              mountPath: /tmp
      volumes:
        - name: tmp
          emptyDir: {}
8-rolling-update.yaml

Configure rolling update strategy. Defaults are conservative. maxSurge = extra Pods during update (speed). maxUnavailable = Pods that can be down (capacity). For zero-downtime: maxUnavailable: 0.

spec:
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
9-labels.yaml

Add operational labels. You’ll need these when everything’s on fire at 3am. app for selecting resources, version for rollback decisions, team for finding who owns it.

metadata:
  name: api
  labels:
    app: api
    app.kubernetes.io/name: api
    app.kubernetes.io/version: "2.1.0"
    app.kubernetes.io/component: backend
    team: platform
    environment: production
10-complete.yaml

Full production-ready Deployment. All the pieces together. This survives node failures, deploys cleanly, shuts down gracefully, and doesn’t run as root.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api
  labels:
    app: api
    team: platform
spec:
  replicas: 3
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  selector:
    matchLabels:
      app: api
  template:
    metadata:
      labels:
        app: api
    spec:
      terminationGracePeriodSeconds: 60
      securityContext:
        runAsNonRoot: true
        runAsUser: 1000
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              podAffinityTerm:
                labelSelector:
                  matchLabels:
                    app: api
                topologyKey: kubernetes.io/hostname
      containers:
        - name: api
          image: api:2.1.0
          ports:
            - containerPort: 8080
          resources:
            requests:
              cpu: 100m
              memory: 256Mi
            limits:
              cpu: 500m
              memory: 256Mi
          readinessProbe:
            httpGet:
              path: /health/ready
              port: 8080
            initialDelaySeconds: 5
            periodSeconds: 5
          livenessProbe:
            httpGet:
              path: /health/live
              port: 8080
            initialDelaySeconds: 15
            periodSeconds: 10
          securityContext:
            readOnlyRootFilesystem: true
            allowPrivilegeEscalation: false
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh", "-c", "sleep 5"]

Index | Use arrow keys to navigate