Multi-Region AWS Infrastructure for Resilience: A Terraform Deep Dive

Why Multi-Region Matters

Single-region architectures are a ticking time bomb for companies that can't afford downtime. AWS regions fail. Not often — but when they do, single-region architectures bring entire platforms to their knees. This article walks through a production-ready multi-region setup that has kept systems running at 99.99% uptime through two major AWS incidents.

ℹ️This setup was designed for a fintech platform serving 2M+ users. The architecture assumes active-active routing with automatic failover. Adapt cost considerations for your use case.

Architecture Overview

The core idea: two fully independent AWS regions (us-east-1 and us-west-2) with identical infrastructure, connected via Transit Gateway peering and routed via Route 53 latency-based policies.

Multi-Region Architecture Diagram

Terraform Module Structure

We manage this with a clean Terraform module structure. The key principle: identical modules per region, driven by a top-level configuration.

# terraform/
#  modules/
#    vpc/         — VPC, subnets, route tables
#    compute/     — ECS cluster, Fargate tasks
#    database/    — Aurora global cluster
#    tgw/         — Transit Gateway + peering
#  regions/
#    us-east-1/main.tf
#    us-west-2/main.tf
#  global/
#    route53.tf
#    global-accelerator.tf

module "vpc_primary" {
  source = "../../modules/vpc"
  cidr_block = "10.0.0.0/16"
  azs        = ["us-east-1a", "us-east-1b", "us-east-1c"]
  region     = "us-east-1"
}

module "transit_gateway" {
  source       = "../../modules/tgw"
  primary_vpc  = module.vpc_primary.vpc_id
  peer_region  = "us-west-2"
  peer_tgw_id  = var.west_tgw_id
}

Failover Strategy

Route 53 health checks monitor both NLBs every 10 seconds. If an endpoint fails 3 consecutive checks, traffic automatically shifts to the healthy region. The typical failover time is under 60 seconds end-to-end.

💡Pro tip: Set your Route 53 TTL to 30 seconds (not 60) for faster client cache invalidation during failover events. The cost in DNS query volume is negligible.

Aurora Global Database

Aurora Global Database handles cross-region replication with typical lag under 1 second. The replica in us-west-2 is read-only during normal operations. On failover, you promote the replica — a process that takes approximately 1 minute and is fully automated with a Lambda failover function.

# Aurora Global Cluster
resource "aws_rds_global_cluster" "main" {
  global_cluster_identifier = "prod-global"
  engine                    = "aurora-postgresql"
  engine_version            = "15.4"
  database_name             = "appdb"
  storage_encrypted         = true
  deletion_protection       = true
}

# Lambda failover trigger
resource "aws_cloudwatch_metric_alarm" "rds_failover" {
  alarm_name    = "aurora-primary-unhealthy"
  alarm_actions = [aws_sns_topic.failover.arn]
  threshold     = 1
  comparison_operator = "GreaterThanOrEqualToThreshold"
}

Cost Considerations

Multi-region is expensive. Here's what doubles: EC2/Fargate compute, RDS instances, NAT Gateways, and data transfer costs. Our fully loaded multi-region setup added approximately 65% to our AWS bill versus single-region. For a platform where downtime costs $50K/hour, it's an obvious trade-off.

⚠️Don't run identical instance sizes in both regions. The secondary region can use smaller instances if your RTO allows for a few minutes of scale-up during failover. We cut cross-region costs by 30% this way.

Results After 12 Months

After running this architecture in production for 12 months, we achieved: 99.994% uptime, two successful transparent failovers (one planned DR test, one real AWS incident), and zero customer-reported outages during failover events. The investment paid for itself the first time us-east-1 had its bad day.