Multi-Region AWS Infrastructure for Resilience: A Terraform Deep Dive
How to architect highly available, multi-region AWS infrastructure using Terraform, Transit Gateway, and intelligent routing strategies for enterprise-grade applications.
Why Multi-Region Matters
Single-region architectures are a ticking time bomb for companies that can't afford downtime. AWS regions fail. Not often — but when they do, single-region architectures bring entire platforms to their knees. This article walks through a production-ready multi-region setup that has kept systems running at 99.99% uptime through two major AWS incidents.
Architecture Overview
The core idea: two fully independent AWS regions (us-east-1 and us-west-2) with identical infrastructure, connected via Transit Gateway peering and routed via Route 53 latency-based policies.
Terraform Module Structure
We manage this with a clean Terraform module structure. The key principle: identical modules per region, driven by a top-level configuration.
# terraform/
# modules/
# vpc/ — VPC, subnets, route tables
# compute/ — ECS cluster, Fargate tasks
# database/ — Aurora global cluster
# tgw/ — Transit Gateway + peering
# regions/
# us-east-1/main.tf
# us-west-2/main.tf
# global/
# route53.tf
# global-accelerator.tf
module "vpc_primary" {
source = "../../modules/vpc"
cidr_block = "10.0.0.0/16"
azs = ["us-east-1a", "us-east-1b", "us-east-1c"]
region = "us-east-1"
}
module "transit_gateway" {
source = "../../modules/tgw"
primary_vpc = module.vpc_primary.vpc_id
peer_region = "us-west-2"
peer_tgw_id = var.west_tgw_id
}
Failover Strategy
Route 53 health checks monitor both NLBs every 10 seconds. If an endpoint fails 3 consecutive checks, traffic automatically shifts to the healthy region. The typical failover time is under 60 seconds end-to-end.
Aurora Global Database
Aurora Global Database handles cross-region replication with typical lag under 1 second. The replica in us-west-2 is read-only during normal operations. On failover, you promote the replica — a process that takes approximately 1 minute and is fully automated with a Lambda failover function.
# Aurora Global Cluster
resource "aws_rds_global_cluster" "main" {
global_cluster_identifier = "prod-global"
engine = "aurora-postgresql"
engine_version = "15.4"
database_name = "appdb"
storage_encrypted = true
deletion_protection = true
}
# Lambda failover trigger
resource "aws_cloudwatch_metric_alarm" "rds_failover" {
alarm_name = "aurora-primary-unhealthy"
alarm_actions = [aws_sns_topic.failover.arn]
threshold = 1
comparison_operator = "GreaterThanOrEqualToThreshold"
}
Cost Considerations
Multi-region is expensive. Here's what doubles: EC2/Fargate compute, RDS instances, NAT Gateways, and data transfer costs. Our fully loaded multi-region setup added approximately 65% to our AWS bill versus single-region. For a platform where downtime costs $50K/hour, it's an obvious trade-off.
Results After 12 Months
After running this architecture in production for 12 months, we achieved: 99.994% uptime, two successful transparent failovers (one planned DR test, one real AWS incident), and zero customer-reported outages during failover events. The investment paid for itself the first time us-east-1 had its bad day.