Zero-Downtime EKS Upgrades in Production
A blue-green node group strategy for EKS cluster upgrades with automated rollback, PodDisruptionBudgets, and Terraform orchestration achieving zero customer impact.
The Upgrade Problem
Kubernetes releases a new minor version every ~4 months. AWS EKS supports each version for approximately 14 months. That means you're upgrading every few months or you're accumulating technical debt and security exposure. Most teams dread EKS upgrades because they cause pod evictions, application restarts, and customer-visible errors.
After one painful upgrade that caused 12 minutes of degraded service, we designed this blue-green node group approach that has since delivered 6 consecutive upgrades with zero customer impact.
Blue-Green Node Groups Explained
Instead of upgrading existing nodes in-place, we provision entirely new node groups on the target Kubernetes version (green), drain workloads onto them, then terminate the old nodes (blue). The key insight: the control plane upgrade is separate from the data plane upgrade.
PodDisruptionBudgets Are Non-Negotiable
Before running a single kubectl drain, every deployment must have a PDB configured. Without them, draining a node can evict all replicas of a critical service simultaneously.
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: api-service-pdb
spec:
maxUnavailable: 1 # Never take down more than 1 pod at a time
selector:
matchLabels:
app: api-service
---
# For critical services, use minAvailable instead
spec:
minAvailable: "80%" # Always keep 80% of pods running
Automated Upgrade Script
We wrapped the entire procedure in a Bash script triggered by GitHub Actions. The script has three phases: Pre-flight checks, Green node provisioning, and Drain and cleanup. Crucially, it auto-aborts and rolls back if any phase detects degraded pod health.
#!/bin/bash
# eks-upgrade.sh — Phase 2: Drain blue nodes
BLUE_NODES=$(kubectl get nodes -l nodegroup=blue -o name)
for node in $BLUE_NODES; do
echo "Draining $node..."
# Cordon first (no new pods scheduled)
kubectl cordon $node
# Check pod health before draining
UNHEALTHY=$(kubectl get pods -A --field-selector spec.nodeName=${node#node/} | grep -v Running | grep -v Completed | wc -l)
if [ $UNHEALTHY -gt 0 ]; then
echo "❌ Unhealthy pods detected. Aborting."
kubectl uncordon $node
exit 1
fi
kubectl drain $node --ignore-daemonsets --delete-emptydir-data --timeout=300s
done
Upgrade Checklist
- ✅ Update EKS add-ons (kube-proxy, CoreDNS, vpc-cni) before node drain
- ✅ Verify all PDBs are in place with
kubectl get pdb -A - ✅ Scale down non-critical workloads before upgrade window
- ✅ Have a tested rollback procedure (keep blue nodes for 24 hours)
- ✅ Monitor error rates in CloudWatch during drain phase