Zero-Downtime EKS Upgrades in Production

The Upgrade Problem

Kubernetes releases a new minor version every ~4 months. AWS EKS supports each version for approximately 14 months. That means you're upgrading every few months or you're accumulating technical debt and security exposure. Most teams dread EKS upgrades because they cause pod evictions, application restarts, and customer-visible errors.

After one painful upgrade that caused 12 minutes of degraded service, we designed this blue-green node group approach that has since delivered 6 consecutive upgrades with zero customer impact.

ℹ️This guide covers EKS managed node groups. Self-managed node groups follow similar principles but require more manual steps for AMI rotation.

Blue-Green Node Groups Explained

Instead of upgrading existing nodes in-place, we provision entirely new node groups on the target Kubernetes version (green), drain workloads onto them, then terminate the old nodes (blue). The key insight: the control plane upgrade is separate from the data plane upgrade.

Blue-Green Node Group Upgrade Flow

PodDisruptionBudgets Are Non-Negotiable

Before running a single kubectl drain, every deployment must have a PDB configured. Without them, draining a node can evict all replicas of a critical service simultaneously.

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api-service-pdb
spec:
  maxUnavailable: 1   # Never take down more than 1 pod at a time
  selector:
    matchLabels:
      app: api-service

---
# For critical services, use minAvailable instead
spec:
  minAvailable: "80%"  # Always keep 80% of pods running

Automated Upgrade Script

We wrapped the entire procedure in a Bash script triggered by GitHub Actions. The script has three phases: Pre-flight checks, Green node provisioning, and Drain and cleanup. Crucially, it auto-aborts and rolls back if any phase detects degraded pod health.

#!/bin/bash
# eks-upgrade.sh — Phase 2: Drain blue nodes

BLUE_NODES=$(kubectl get nodes -l nodegroup=blue -o name)

for node in $BLUE_NODES; do
  echo "Draining $node..."

  # Cordon first (no new pods scheduled)
  kubectl cordon $node

  # Check pod health before draining
  UNHEALTHY=$(kubectl get pods -A --field-selector spec.nodeName=${node#node/}     | grep -v Running | grep -v Completed | wc -l)

  if [ $UNHEALTHY -gt 0 ]; then
    echo "❌ Unhealthy pods detected. Aborting."
    kubectl uncordon $node
    exit 1
  fi

  kubectl drain $node     --ignore-daemonsets     --delete-emptydir-data     --timeout=300s
done

Upgrade Checklist

✅ Update EKS add-ons (kube-proxy, CoreDNS, vpc-cni) before node drain
✅ Verify all PDBs are in place with kubectl get pdb -A
✅ Scale down non-critical workloads before upgrade window
✅ Have a tested rollback procedure (keep blue nodes for 24 hours)
✅ Monitor error rates in CloudWatch during drain phase

💡Keep your blue node group around for 24 hours after migration. If something goes wrong in production, you can quickly schedule pods back onto the old nodes while you investigate.