← back to blog
devops Featured · Jan 5, 2026 · 10 min read

Monitoring & Alerting a Web App on AWS: The Minimal Setup That Actually Works

Stop drowning in CloudWatch noise. Here's the lean, opinionated monitoring stack — ALB, RDS, SQS, and budget alerts — that keeps your team sane and your customers happy.

SJ
Sabin Joshi
DevOps Engineer
#aws #cloudwatch #monitoring #alerting #alb #rds #sqs #devops

The Problem with Most Monitoring Setups

Most teams set up CloudWatch alarms for everything — CPU utilization, memory, network in/out, disk I/O, and every metric AWS exposes. The result? A noisy alert system that wakes engineers up at 3am for things that aren't customer-impacting, breeding alert fatigue until the alerts everyone actually needs gets ignored.

The right mental model: only alert when customers are affected or when the application hits a failure it can't self-heal. Everything else can be a dashboard metric you check during working hours.

💡 You don't need to monitor EC2 instances directly for most web apps. App server failures surface as 5XX errors at the load balancer. Worker failures surface as stale SQS messages. Let the symptoms tell you something is wrong — not the cause.

Your Typical Web App Stack

This setup targets the most common AWS web application architecture: clients hitting an Application Load Balancer, which routes to EC2 app servers, backed by an RDS database and an SQS queue processed by background worker instances.

Typical Web Application Architecture on AWS
Internet Users ALB Application Load Balancer 📊 5XX · Latency · Rejected EC2 App web servers EC2 App web servers 🚫 no direct monitoring needed RDS PostgreSQL / MySQL 📊 FreeStorageSpace SQS Queue job queue 📊 OldestMsg Age · DLQ Length EC2 Workers process jobs 🚫 SQS alarms cover this Dead-Letter Q failed messages AWS CloudWatch — Unified Alarm Layer for All Monitored Components

Monitoring the Application Load Balancer

The ALB is the front door to your entire infrastructure and the most important component to watch. It sits between your users and your backend, which makes it the single source of truth for customer-visible errors. Four alarms give you complete coverage.

5XX Errors — Load Balancer
Fires when the ALB itself fails to process a request and returns a server error. These are infrastructure-level failures, not app failures.
MetricHTTPCode_ELB_5XX_Count
NamespaceAWS/ApplicationELB
Period1 minute · 5 of 5 periods
StatisticSum
Threshold> 1
5XX Errors — Target
Fires when your EC2 instances return errors. This catches app-level failures — unhandled exceptions, crashes, or unhealthy deployments.
MetricHTTPCode_Target_5XX_Count
NamespaceAWS/ApplicationELB
Period1 minute · 5 of 5 periods
StatisticSum
Threshold> 1
Target Response Latency
High latency is a silent killer — your app is technically "up" but users are hitting spinners. Alert before they notice.
MetricTargetResponseTime
NamespaceAWS/ApplicationELB
Period1 minute · 5 of 5 periods
StatisticAverage
Threshold> 0.2 seconds
Rejected Connections
The ALB hit its connection limit and started dropping requests. This usually means your backend can't scale fast enough — users are getting immediate errors.
MetricRejectedConnectionCount
NamespaceAWS/ApplicationELB
Period1 minute · 5 of 5 periods
StatisticSum
Threshold> 1

All four alarms share the same dimension: LoadBalancer ID. Here's how to wire them up using the AWS CLI or Terraform:

# Terraform — ALB 5XX alarm (same pattern for all 4)
resource "aws_cloudwatch_metric_alarm" "alb_5xx_elb" {
  alarm_name          = "alb-5xx-load-balancer"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 5
  metric_name         = "HTTPCode_ELB_5XX_Count"
  namespace           = "AWS/ApplicationELB"
  period              = 60   # 1 minute
  statistic           = "Sum"
  threshold           = 1
  treat_missing_data  = "notBreaching"  # no traffic = no error

  dimensions = {
    LoadBalancer = aws_lb.main.arn_suffix
  }

  alarm_actions = [aws_sns_topic.alerts.arn]
  ok_actions    = [aws_sns_topic.alerts.arn]
}

# Latency alarm — use p99 for high traffic (>1000 req/min)
resource "aws_cloudwatch_metric_alarm" "alb_latency" {
  alarm_name          = "alb-high-latency"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 5
  metric_name         = "TargetResponseTime"
  namespace           = "AWS/ApplicationELB"
  period              = 60
  statistic           = "Average"   # use p99 for high-volume apps
  threshold           = 0.2         # 200ms — tune to your baseline

  dimensions = {
    LoadBalancer = aws_lb.main.arn_suffix
  }

  alarm_actions = [aws_sns_topic.alerts.arn]
}
⚠️ Set treat_missing_data = "notBreaching" on all Sum-based alarms (5XX, rejected connections). If your app has zero traffic (e.g. overnight), a missing data point shouldn't trigger a 3am page.

Monitoring RDS — The One Alarm That Matters

RDS exposes dozens of metrics but for most applications, free storage space is the one that will actually take you down. Running out of disk corrupts your database, loses writes silently, and causes cascading failures. CPU and connection count spikes are usually survivable — disk filling up is not.

RDS Free Storage Space Alarm — Threshold Anatomy
Used Storage (85%) Free ⚠ alarm threshold 1 GB (~1,000,000,000 bytes) 0 bytes Total Allocated Metric: FreeStorageSpace · Statistic: Minimum · Threshold: < 1,000,000,000 bytes
resource "aws_cloudwatch_metric_alarm" "rds_storage" {
  alarm_name          = "rds-low-free-storage"
  alarm_description   = "RDS free storage below 1GB — risk of data loss"
  comparison_operator = "LessThanThreshold"
  evaluation_periods  = 5
  metric_name         = "FreeStorageSpace"
  namespace           = "AWS/RDS"
  period              = 60
  statistic           = "Minimum"  # worst case in the period
  threshold           = 1000000000  # 1 GB in bytes

  dimensions = {
    DBInstanceIdentifier = aws_db_instance.main.identifier
  }

  alarm_actions = [aws_sns_topic.alerts.arn]
}
💡 Enable RDS Storage Auto Scaling as a safety net alongside this alarm. Set the alarm to trigger at 1GB free — it gives you time to react before Auto Scaling kicks in or you need to manually resize. Never rely on Auto Scaling alone without an alarm.

Monitoring SQS — Two Alarms, Full Coverage

SQS queues power your background job processing. Two metrics tell you everything you need to know about queue health without needing to monitor the worker EC2 instances at all.

SQS Queue Health — Two Alarm Scenarios
SCENARIO 1 — Workers slow / down SQS Queue — messages piling up oldest msg msg msg ... Worker (slow/down) 🔔 ALARM FIRES OldestMsg > 500s SCENARIO 2 — Message processing failures SQS Queue Worker Dead-Letter Queue failed after 3 retries 🔔 ALARM FIRES DLQ length > 0 Both alarms → SNS Topic → Email / Slack / PagerDuty
Oldest Message Age
When jobs sit in the queue too long, workers aren't keeping up — possibly slow, crashed, or under-scaled. This catches the problem before the backlog becomes unrecoverable.
MetricApproximateAgeOfOldestMessage
NamespaceAWS/SQS
Period5 minutes · 5 of 5 periods
StatisticMaximum
Threshold> 500 seconds
Dead-Letter Queue Length
Any message in the DLQ is a job that failed permanently — after all retries. Even one DLQ message means something in your processing pipeline is broken and needs investigation.
MetricApproximateNumberOfMessagesVisible
NamespaceAWS/SQS
Period5 minutes · 5 of 5 periods
DimensionQueueName = my-dlq
Threshold> 0
# SQS Oldest Message Age alarm
resource "aws_cloudwatch_metric_alarm" "sqs_oldest_msg" {
  alarm_name          = "sqs-oldest-message-age"
  alarm_description   = "Jobs sitting in queue > 500s — workers may be down"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 5
  metric_name         = "ApproximateAgeOfOldestMessage"
  namespace           = "AWS/SQS"
  period              = 300  # 5 minutes
  statistic           = "Maximum"
  threshold           = 500
  dimensions = { QueueName = aws_sqs_queue.jobs.name }
  alarm_actions = [aws_sns_topic.alerts.arn]
}

# DLQ length — alert on ANY message (threshold = 0)
resource "aws_cloudwatch_metric_alarm" "dlq_length" {
  alarm_name          = "dlq-has-messages"
  alarm_description   = "Dead-letter queue not empty — job processing failures"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 1   # alert immediately — no waiting
  metric_name         = "ApproximateNumberOfMessagesVisible"
  namespace           = "AWS/SQS"
  period              = 300
  statistic           = "Maximum"
  threshold           = 0
  treat_missing_data  = "notBreaching"
  dimensions = { QueueName = aws_sqs_queue.jobs_dlq.name }
  alarm_actions = [aws_sns_topic.alerts.arn]
}

Why You Can Skip EC2 Monitoring

This is the counterintuitive part that trips most people up. You're running EC2 instances — shouldn't you monitor them?

For app servers behind the ALB: any failure — OOM crash, CPU spike causing slowdowns, failed deployment — will show up as a 5XX error or high latency at the load balancer. That's your signal. Monitoring EC2 CPU at 90% doesn't tell you if customers are impacted; the ALB latency alarm does.

For worker instances processing SQS jobs: if a worker crashes or runs out of memory, messages stop being consumed. The Oldest Message Age alarm catches this within 25 minutes (5 periods × 5 minutes). If workers throw errors and keep crashing, messages pile up in the DLQ. Both scenarios are covered without a single EC2 alarm.

ℹ️ This doesn't mean you ignore EC2 metrics forever. Once an alarm fires and you're investigating, pull up EC2 CPU, memory (via CloudWatch Agent), and disk metrics in the console. But these are diagnostic tools during incidents, not proactive alarms.

Budget Monitoring — Don't Skip This

Your infrastructure costs money every hour it runs. A misconfigured auto-scaling group, an accidental large instance type, or a data transfer spike can triple your bill before the end of the month. AWS Budgets lets you alert on both actual and forecasted spend so you catch runaway costs early.

AWS Budgets — Actual vs Forecasted Spend Alert Model
budget today 🔔 forecast alert projected over budget Actual spend (month-to-date) Forecasted end-of-month spend Monthly budget Day 1 Day 15 Day 30
resource "aws_budgets_budget" "monthly" {
  name         = "monthly-infrastructure-budget"
  budget_type  = "COST"
  limit_amount = "1000"  # set your expected monthly spend
  limit_unit   = "USD"
  time_unit    = "MONTHLY"

  # Alert 1: actual spend hits 80% of budget
  notification {
    comparison_operator        = "GREATER_THAN"
    threshold                  = 80
    threshold_type             = "PERCENTAGE"
    notification_type          = "ACTUAL"
    subscriber_email_addresses = ["team@yourcompany.com"]
  }

  # Alert 2: forecasted spend will exceed 100%
  notification {
    comparison_operator        = "GREATER_THAN"
    threshold                  = 100
    threshold_type             = "PERCENTAGE"
    notification_type          = "FORECASTED"  # future projection
    subscriber_email_addresses = ["team@yourcompany.com"]
  }
}

Wiring It All Together with SNS

All alarms should route to a single SNS topic that fans out to your notification channels — email, Slack, PagerDuty, or all three. This makes it trivial to add or change notification destinations without touching any alarm definitions.

# Central SNS topic for all alarms
resource "aws_sns_topic" "alerts" {
  name = "production-alerts"
}

# Email subscription
resource "aws_sns_topic_subscription" "email" {
  topic_arn = aws_sns_topic.alerts.arn
  protocol  = "email"
  endpoint  = "oncall@yourcompany.com"
}

# Slack via Lambda (or use AWS Chatbot)
resource "aws_sns_topic_subscription" "slack" {
  topic_arn = aws_sns_topic.alerts.arn
  protocol  = "lambda"
  endpoint  = aws_lambda_function.slack_notifier.arn
}

The Full Alarm Set — Summary

Eight alarms. That's all you need for a production web application on AWS. Not eighty. Not eight hundred.

4
ALB alarms
1
RDS alarm
2
SQS alarms
1
Budget alarm

Each alarm maps directly to a customer-impacting scenario: users getting errors, users experiencing slowness, users' jobs failing, the app going down from disk corruption, or costs blowing out before someone notices. Everything else is noise.

💡 Start with these eight alarms, tune the thresholds to your application's baseline over the first two weeks, then resist the temptation to add more. Every additional alarm is a pager notification that trains your team to be a little less responsive to the ones that matter.