Monitoring & Alerting a Web App on AWS: The Minimal Setup That Actually Works
Stop drowning in CloudWatch noise. Here's the lean, opinionated monitoring stack — ALB, RDS, SQS, and budget alerts — that keeps your team sane and your customers happy.
The Problem with Most Monitoring Setups
Most teams set up CloudWatch alarms for everything — CPU utilization, memory, network in/out, disk I/O, and every metric AWS exposes. The result? A noisy alert system that wakes engineers up at 3am for things that aren't customer-impacting, breeding alert fatigue until the alerts everyone actually needs gets ignored.
The right mental model: only alert when customers are affected or when the application hits a failure it can't self-heal. Everything else can be a dashboard metric you check during working hours.
Your Typical Web App Stack
This setup targets the most common AWS web application architecture: clients hitting an Application Load Balancer, which routes to EC2 app servers, backed by an RDS database and an SQS queue processed by background worker instances.
Monitoring the Application Load Balancer
The ALB is the front door to your entire infrastructure and the most important component to watch. It sits between your users and your backend, which makes it the single source of truth for customer-visible errors. Four alarms give you complete coverage.
All four alarms share the same dimension: LoadBalancer ID. Here's how to wire them up using the AWS CLI or Terraform:
# Terraform — ALB 5XX alarm (same pattern for all 4)
resource "aws_cloudwatch_metric_alarm" "alb_5xx_elb" {
alarm_name = "alb-5xx-load-balancer"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 5
metric_name = "HTTPCode_ELB_5XX_Count"
namespace = "AWS/ApplicationELB"
period = 60 # 1 minute
statistic = "Sum"
threshold = 1
treat_missing_data = "notBreaching" # no traffic = no error
dimensions = {
LoadBalancer = aws_lb.main.arn_suffix
}
alarm_actions = [aws_sns_topic.alerts.arn]
ok_actions = [aws_sns_topic.alerts.arn]
}
# Latency alarm — use p99 for high traffic (>1000 req/min)
resource "aws_cloudwatch_metric_alarm" "alb_latency" {
alarm_name = "alb-high-latency"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 5
metric_name = "TargetResponseTime"
namespace = "AWS/ApplicationELB"
period = 60
statistic = "Average" # use p99 for high-volume apps
threshold = 0.2 # 200ms — tune to your baseline
dimensions = {
LoadBalancer = aws_lb.main.arn_suffix
}
alarm_actions = [aws_sns_topic.alerts.arn]
}
treat_missing_data = "notBreaching" on all Sum-based alarms (5XX, rejected connections). If your app has zero traffic (e.g. overnight), a missing data point shouldn't trigger a 3am page.
Monitoring RDS — The One Alarm That Matters
RDS exposes dozens of metrics but for most applications, free storage space is the one that will actually take you down. Running out of disk corrupts your database, loses writes silently, and causes cascading failures. CPU and connection count spikes are usually survivable — disk filling up is not.
resource "aws_cloudwatch_metric_alarm" "rds_storage" {
alarm_name = "rds-low-free-storage"
alarm_description = "RDS free storage below 1GB — risk of data loss"
comparison_operator = "LessThanThreshold"
evaluation_periods = 5
metric_name = "FreeStorageSpace"
namespace = "AWS/RDS"
period = 60
statistic = "Minimum" # worst case in the period
threshold = 1000000000 # 1 GB in bytes
dimensions = {
DBInstanceIdentifier = aws_db_instance.main.identifier
}
alarm_actions = [aws_sns_topic.alerts.arn]
}
Monitoring SQS — Two Alarms, Full Coverage
SQS queues power your background job processing. Two metrics tell you everything you need to know about queue health without needing to monitor the worker EC2 instances at all.
# SQS Oldest Message Age alarm
resource "aws_cloudwatch_metric_alarm" "sqs_oldest_msg" {
alarm_name = "sqs-oldest-message-age"
alarm_description = "Jobs sitting in queue > 500s — workers may be down"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 5
metric_name = "ApproximateAgeOfOldestMessage"
namespace = "AWS/SQS"
period = 300 # 5 minutes
statistic = "Maximum"
threshold = 500
dimensions = { QueueName = aws_sqs_queue.jobs.name }
alarm_actions = [aws_sns_topic.alerts.arn]
}
# DLQ length — alert on ANY message (threshold = 0)
resource "aws_cloudwatch_metric_alarm" "dlq_length" {
alarm_name = "dlq-has-messages"
alarm_description = "Dead-letter queue not empty — job processing failures"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 1 # alert immediately — no waiting
metric_name = "ApproximateNumberOfMessagesVisible"
namespace = "AWS/SQS"
period = 300
statistic = "Maximum"
threshold = 0
treat_missing_data = "notBreaching"
dimensions = { QueueName = aws_sqs_queue.jobs_dlq.name }
alarm_actions = [aws_sns_topic.alerts.arn]
}
Why You Can Skip EC2 Monitoring
This is the counterintuitive part that trips most people up. You're running EC2 instances — shouldn't you monitor them?
For app servers behind the ALB: any failure — OOM crash, CPU spike causing slowdowns, failed deployment — will show up as a 5XX error or high latency at the load balancer. That's your signal. Monitoring EC2 CPU at 90% doesn't tell you if customers are impacted; the ALB latency alarm does.
For worker instances processing SQS jobs: if a worker crashes or runs out of memory, messages stop being consumed. The Oldest Message Age alarm catches this within 25 minutes (5 periods × 5 minutes). If workers throw errors and keep crashing, messages pile up in the DLQ. Both scenarios are covered without a single EC2 alarm.
Budget Monitoring — Don't Skip This
Your infrastructure costs money every hour it runs. A misconfigured auto-scaling group, an accidental large instance type, or a data transfer spike can triple your bill before the end of the month. AWS Budgets lets you alert on both actual and forecasted spend so you catch runaway costs early.
resource "aws_budgets_budget" "monthly" {
name = "monthly-infrastructure-budget"
budget_type = "COST"
limit_amount = "1000" # set your expected monthly spend
limit_unit = "USD"
time_unit = "MONTHLY"
# Alert 1: actual spend hits 80% of budget
notification {
comparison_operator = "GREATER_THAN"
threshold = 80
threshold_type = "PERCENTAGE"
notification_type = "ACTUAL"
subscriber_email_addresses = ["team@yourcompany.com"]
}
# Alert 2: forecasted spend will exceed 100%
notification {
comparison_operator = "GREATER_THAN"
threshold = 100
threshold_type = "PERCENTAGE"
notification_type = "FORECASTED" # future projection
subscriber_email_addresses = ["team@yourcompany.com"]
}
}
Wiring It All Together with SNS
All alarms should route to a single SNS topic that fans out to your notification channels — email, Slack, PagerDuty, or all three. This makes it trivial to add or change notification destinations without touching any alarm definitions.
# Central SNS topic for all alarms
resource "aws_sns_topic" "alerts" {
name = "production-alerts"
}
# Email subscription
resource "aws_sns_topic_subscription" "email" {
topic_arn = aws_sns_topic.alerts.arn
protocol = "email"
endpoint = "oncall@yourcompany.com"
}
# Slack via Lambda (or use AWS Chatbot)
resource "aws_sns_topic_subscription" "slack" {
topic_arn = aws_sns_topic.alerts.arn
protocol = "lambda"
endpoint = aws_lambda_function.slack_notifier.arn
}
The Full Alarm Set — Summary
Eight alarms. That's all you need for a production web application on AWS. Not eighty. Not eight hundred.
Each alarm maps directly to a customer-impacting scenario: users getting errors, users experiencing slowness, users' jobs failing, the app going down from disk corruption, or costs blowing out before someone notices. Everything else is noise.