K8s by Example: Production Checklist

A Deployment that works in dev often breaks in prod. This checklist covers what you need: resource constraints, health checks, graceful shutdown, and high availability.

	1-resources.yaml
Set resource requests and limits. Without requests, the scheduler can’t make good decisions. Your Pod might land on an overloaded node. Without limits, a memory leak takes down the entire node.	`resources: requests: cpu: 100m memory: 256Mi limits: cpu: 500m memory: 256Mi`
Set memory limit equal to request. This creates Guaranteed QoS class and makes scheduling predictable. CPU can burst (set limit higher than request), but memory should match to avoid OOMKill surprises when pods use more than requested.	`resources: requests: cpu: 100m memory: 512Mi limits: cpu: 1 memory: 512Mi`

	2-probes.yaml
Configure readiness probe. Without it, traffic hits Pods before they’re ready. During rolling updates, users see errors. The probe should check if your app can actually serve requests, not just if the process is alive.	`readinessProbe: httpGet: path: /health/ready port: 8080 initialDelaySeconds: 5 periodSeconds: 5 failureThreshold: 3`
Configure liveness probe. Restarts stuck processes. Be careful: if your liveness check is too aggressive, you’ll restart healthy Pods under load. Make it simpler than readiness - just “is the process responding at all?”	`livenessProbe: httpGet: path: /health/live port: 8080 initialDelaySeconds: 15 periodSeconds: 10 failureThreshold: 3`

	3-shutdown.yaml
Handle SIGTERM. Kubernetes sends SIGTERM, waits `terminationGracePeriodSeconds`, then SIGKILL. Your app must catch SIGTERM and finish in-flight requests. 30s default is often not enough for long-running operations.	`spec: terminationGracePeriodSeconds: 60 containers: - name: api lifecycle: preStop: exec: command: ["/bin/sh", "-c", "sleep 5"]`
The preStop sleep trick: Load balancers take a few seconds to remove endpoints. Without the sleep, requests hit terminating Pods. The sleep gives the LB time to update, then your app handles SIGTERM cleanly.	`process.on('SIGTERM', async () => { console.log('SIGTERM received, finishing requests...'); server.close(async () => { await db.disconnect(); process.exit(0); }); });`

	4-replicas.yaml
Run multiple replicas. Single replica = single point of failure. During node maintenance, deploys, or crashes - you’re down. Minimum 2 for any service that matters. 3+ for critical paths.	`apiVersion: apps/v1 kind: Deployment metadata: name: api spec: replicas: 3 selector: matchLabels: app: api`

	5-anti-affinity.yaml
Spread Pods across nodes. 3 replicas on 1 node = still a single point of failure. Anti-affinity ensures Pods land on different nodes. Use `preferredDuringScheduling` so Pods can still schedule if nodes are limited.	`spec: template: spec: affinity: podAntiAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 100 podAffinityTerm: labelSelector: matchLabels: app: api topologyKey: kubernetes.io/hostname`

	6-pdb.yaml
Set Pod Disruption Budget. Without PDB, `kubectl drain` can evict all your Pods at once during maintenance. PDB guarantees minimum availability. `maxUnavailable: 1` means only 1 Pod can be down at a time.	`apiVersion: policy/v1 kind: PodDisruptionBudget metadata: name: api-pdb spec: maxUnavailable: 1 selector: matchLabels: app: api`

	7-security.yaml
Drop root privileges. Running as root inside the container is unnecessary for most apps and dangerous if compromised. Set `runAsNonRoot: true` and specify a user ID.	`spec: template: spec: securityContext: runAsNonRoot: true runAsUser: 1000 fsGroup: 1000`
Make filesystem read-only. Prevents attackers from dropping malware. Use `emptyDir` for directories that need writes (tmp, cache). Drop all capabilities your app doesn’t need.	`containers: - name: api securityContext: readOnlyRootFilesystem: true allowPrivilegeEscalation: false capabilities: drop: ["ALL"] volumeMounts: - name: tmp mountPath: /tmp volumes: - name: tmp emptyDir: {}`

	8-rolling-update.yaml
Configure rolling update strategy. Defaults are conservative. `maxSurge` = extra Pods during update (speed). `maxUnavailable` = Pods that can be down (capacity). For zero-downtime: `maxUnavailable: 0`.	`spec: strategy: type: RollingUpdate rollingUpdate: maxSurge: 1 maxUnavailable: 0`

	9-labels.yaml
Add operational labels. You’ll need these when everything’s on fire at 3am. `app` for selecting resources, `version` for rollback decisions, `team` for finding who owns it.	`metadata: name: api labels: app: api app.kubernetes.io/name: api app.kubernetes.io/version: "2.1.0" app.kubernetes.io/component: backend team: platform environment: production`

10-complete.yaml

Full production-ready Deployment. All the pieces together. This survives node failures, deploys cleanly, shuts down gracefully, and doesn’t run as root.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api
  labels:
    app: api
    team: platform
spec:
  replicas: 3
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  selector:
    matchLabels:
      app: api
  template:
    metadata:
      labels:
        app: api
    spec:
      terminationGracePeriodSeconds: 60
      securityContext:
        runAsNonRoot: true
        runAsUser: 1000
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              podAffinityTerm:
                labelSelector:
                  matchLabels:
                    app: api
                topologyKey: kubernetes.io/hostname
      containers:
        - name: api
          image: api:2.1.0
          ports:
            - containerPort: 8080
          resources:
            requests:
              cpu: 100m
              memory: 256Mi
            limits:
              cpu: 500m
              memory: 256Mi
          readinessProbe:
            httpGet:
              path: /health/ready
              port: 8080
            initialDelaySeconds: 5
            periodSeconds: 5
          livenessProbe:
            httpGet:
              path: /health/live
              port: 8080
            initialDelaySeconds: 15
            periodSeconds: 10
          securityContext:
            readOnlyRootFilesystem: true
            allowPrivilegeEscalation: false
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh", "-c", "sleep 5"]

Index | Use arrow keys to navigate