Monitoring & Alerting a Web App on AWS

The Problem with Most Monitoring Setups

Most teams set up CloudWatch alarms for everything — CPU utilization, memory, network in/out, disk I/O, and every metric AWS exposes. The result? A noisy alert system that wakes engineers up at 3am for things that aren't customer-impacting, breeding alert fatigue until the alerts everyone actually needs gets ignored.

The right mental model: only alert when customers are affected or when the application hits a failure it can't self-heal. Everything else can be a dashboard metric you check during working hours.

💡 You don't need to monitor EC2 instances directly for most web apps. App server failures surface as 5XX errors at the load balancer. Worker failures surface as stale SQS messages. Let the symptoms tell you something is wrong — not the cause.

Your Typical Web App Stack

This setup targets the most common AWS web application architecture: clients hitting an Application Load Balancer, which routes to EC2 app servers, backed by an RDS database and an SQS queue processed by background worker instances.

Typical Web Application Architecture on AWS

Monitoring the Application Load Balancer

The ALB is the front door to your entire infrastructure and the most important component to watch. It sits between your users and your backend, which makes it the single source of truth for customer-visible errors. Four alarms give you complete coverage.

5XX Errors — Load Balancer

Fires when the ALB itself fails to process a request and returns a server error. These are infrastructure-level failures, not app failures.

MetricHTTPCode_ELB_5XX_Count

NamespaceAWS/ApplicationELB

Period1 minute · 5 of 5 periods

StatisticSum

Threshold> 1

5XX Errors — Target

Fires when your EC2 instances return errors. This catches app-level failures — unhandled exceptions, crashes, or unhealthy deployments.

MetricHTTPCode_Target_5XX_Count

NamespaceAWS/ApplicationELB

Period1 minute · 5 of 5 periods

StatisticSum

Threshold> 1

Target Response Latency

High latency is a silent killer — your app is technically "up" but users are hitting spinners. Alert before they notice.

MetricTargetResponseTime

NamespaceAWS/ApplicationELB

Period1 minute · 5 of 5 periods

StatisticAverage

Threshold> 0.2 seconds

Rejected Connections

The ALB hit its connection limit and started dropping requests. This usually means your backend can't scale fast enough — users are getting immediate errors.

MetricRejectedConnectionCount

NamespaceAWS/ApplicationELB

Period1 minute · 5 of 5 periods

StatisticSum

Threshold> 1

All four alarms share the same dimension: LoadBalancer ID. Here's how to wire them up using the AWS CLI or Terraform:

# Terraform — ALB 5XX alarm (same pattern for all 4)
resource "aws_cloudwatch_metric_alarm" "alb_5xx_elb" {
  alarm_name          = "alb-5xx-load-balancer"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 5
  metric_name         = "HTTPCode_ELB_5XX_Count"
  namespace           = "AWS/ApplicationELB"
  period              = 60   # 1 minute
  statistic           = "Sum"
  threshold           = 1
  treat_missing_data  = "notBreaching"  # no traffic = no error

  dimensions = {
    LoadBalancer = aws_lb.main.arn_suffix
  }

  alarm_actions = [aws_sns_topic.alerts.arn]
  ok_actions    = [aws_sns_topic.alerts.arn]
}

# Latency alarm — use p99 for high traffic (>1000 req/min)
resource "aws_cloudwatch_metric_alarm" "alb_latency" {
  alarm_name          = "alb-high-latency"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 5
  metric_name         = "TargetResponseTime"
  namespace           = "AWS/ApplicationELB"
  period              = 60
  statistic           = "Average"   # use p99 for high-volume apps
  threshold           = 0.2         # 200ms — tune to your baseline

  dimensions = {
    LoadBalancer = aws_lb.main.arn_suffix
  }

  alarm_actions = [aws_sns_topic.alerts.arn]
}

⚠️ Set treat_missing_data = "notBreaching" on all Sum-based alarms (5XX, rejected connections). If your app has zero traffic (e.g. overnight), a missing data point shouldn't trigger a 3am page.

Monitoring RDS — The One Alarm That Matters

RDS exposes dozens of metrics but for most applications, free storage space is the one that will actually take you down. Running out of disk corrupts your database, loses writes silently, and causes cascading failures. CPU and connection count spikes are usually survivable — disk filling up is not.

RDS Free Storage Space Alarm — Threshold Anatomy

resource "aws_cloudwatch_metric_alarm" "rds_storage" {
  alarm_name          = "rds-low-free-storage"
  alarm_description   = "RDS free storage below 1GB — risk of data loss"
  comparison_operator = "LessThanThreshold"
  evaluation_periods  = 5
  metric_name         = "FreeStorageSpace"
  namespace           = "AWS/RDS"
  period              = 60
  statistic           = "Minimum"  # worst case in the period
  threshold           = 1000000000  # 1 GB in bytes

  dimensions = {
    DBInstanceIdentifier = aws_db_instance.main.identifier
  }

  alarm_actions = [aws_sns_topic.alerts.arn]
}

💡 Enable RDS Storage Auto Scaling as a safety net alongside this alarm. Set the alarm to trigger at 1GB free — it gives you time to react before Auto Scaling kicks in or you need to manually resize. Never rely on Auto Scaling alone without an alarm.

Monitoring SQS — Two Alarms, Full Coverage

SQS queues power your background job processing. Two metrics tell you everything you need to know about queue health without needing to monitor the worker EC2 instances at all.

SQS Queue Health — Two Alarm Scenarios

Oldest Message Age

When jobs sit in the queue too long, workers aren't keeping up — possibly slow, crashed, or under-scaled. This catches the problem before the backlog becomes unrecoverable.

MetricApproximateAgeOfOldestMessage

NamespaceAWS/SQS

Period5 minutes · 5 of 5 periods

StatisticMaximum

Threshold> 500 seconds

Dead-Letter Queue Length

Any message in the DLQ is a job that failed permanently — after all retries. Even one DLQ message means something in your processing pipeline is broken and needs investigation.

MetricApproximateNumberOfMessagesVisible

NamespaceAWS/SQS

Period5 minutes · 5 of 5 periods

DimensionQueueName = my-dlq

Threshold> 0

# SQS Oldest Message Age alarm
resource "aws_cloudwatch_metric_alarm" "sqs_oldest_msg" {
  alarm_name          = "sqs-oldest-message-age"
  alarm_description   = "Jobs sitting in queue > 500s — workers may be down"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 5
  metric_name         = "ApproximateAgeOfOldestMessage"
  namespace           = "AWS/SQS"
  period              = 300  # 5 minutes
  statistic           = "Maximum"
  threshold           = 500
  dimensions = { QueueName = aws_sqs_queue.jobs.name }
  alarm_actions = [aws_sns_topic.alerts.arn]
}

# DLQ length — alert on ANY message (threshold = 0)
resource "aws_cloudwatch_metric_alarm" "dlq_length" {
  alarm_name          = "dlq-has-messages"
  alarm_description   = "Dead-letter queue not empty — job processing failures"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 1   # alert immediately — no waiting
  metric_name         = "ApproximateNumberOfMessagesVisible"
  namespace           = "AWS/SQS"
  period              = 300
  statistic           = "Maximum"
  threshold           = 0
  treat_missing_data  = "notBreaching"
  dimensions = { QueueName = aws_sqs_queue.jobs_dlq.name }
  alarm_actions = [aws_sns_topic.alerts.arn]
}

Why You Can Skip EC2 Monitoring

This is the counterintuitive part that trips most people up. You're running EC2 instances — shouldn't you monitor them?

For app servers behind the ALB: any failure — OOM crash, CPU spike causing slowdowns, failed deployment — will show up as a 5XX error or high latency at the load balancer. That's your signal. Monitoring EC2 CPU at 90% doesn't tell you if customers are impacted; the ALB latency alarm does.

For worker instances processing SQS jobs: if a worker crashes or runs out of memory, messages stop being consumed. The Oldest Message Age alarm catches this within 25 minutes (5 periods × 5 minutes). If workers throw errors and keep crashing, messages pile up in the DLQ. Both scenarios are covered without a single EC2 alarm.

ℹ️ This doesn't mean you ignore EC2 metrics forever. Once an alarm fires and you're investigating, pull up EC2 CPU, memory (via CloudWatch Agent), and disk metrics in the console. But these are diagnostic tools during incidents, not proactive alarms.

Budget Monitoring — Don't Skip This

Your infrastructure costs money every hour it runs. A misconfigured auto-scaling group, an accidental large instance type, or a data transfer spike can triple your bill before the end of the month. AWS Budgets lets you alert on both actual and forecasted spend so you catch runaway costs early.

AWS Budgets — Actual vs Forecasted Spend Alert Model

resource "aws_budgets_budget" "monthly" {
  name         = "monthly-infrastructure-budget"
  budget_type  = "COST"
  limit_amount = "1000"  # set your expected monthly spend
  limit_unit   = "USD"
  time_unit    = "MONTHLY"

  # Alert 1: actual spend hits 80% of budget
  notification {
    comparison_operator        = "GREATER_THAN"
    threshold                  = 80
    threshold_type             = "PERCENTAGE"
    notification_type          = "ACTUAL"
    subscriber_email_addresses = ["team@yourcompany.com"]
  }

  # Alert 2: forecasted spend will exceed 100%
  notification {
    comparison_operator        = "GREATER_THAN"
    threshold                  = 100
    threshold_type             = "PERCENTAGE"
    notification_type          = "FORECASTED"  # future projection
    subscriber_email_addresses = ["team@yourcompany.com"]
  }
}

Wiring It All Together with SNS

All alarms should route to a single SNS topic that fans out to your notification channels — email, Slack, PagerDuty, or all three. This makes it trivial to add or change notification destinations without touching any alarm definitions.

# Central SNS topic for all alarms
resource "aws_sns_topic" "alerts" {
  name = "production-alerts"
}

# Email subscription
resource "aws_sns_topic_subscription" "email" {
  topic_arn = aws_sns_topic.alerts.arn
  protocol  = "email"
  endpoint  = "oncall@yourcompany.com"
}

# Slack via Lambda (or use AWS Chatbot)
resource "aws_sns_topic_subscription" "slack" {
  topic_arn = aws_sns_topic.alerts.arn
  protocol  = "lambda"
  endpoint  = aws_lambda_function.slack_notifier.arn
}

The Full Alarm Set — Summary

Eight alarms. That's all you need for a production web application on AWS. Not eighty. Not eight hundred.

ALB alarms

RDS alarm

SQS alarms

Budget alarm

Each alarm maps directly to a customer-impacting scenario: users getting errors, users experiencing slowness, users' jobs failing, the app going down from disk corruption, or costs blowing out before someone notices. Everything else is noise.

💡 Start with these eight alarms, tune the thresholds to your application's baseline over the first two weeks, then resist the temptation to add more. Every additional alarm is a pager notification that trains your team to be a little less responsive to the ones that matter.

Monitoring & Alerting a Web App on AWS: The Minimal Setup That Actually Works

The Problem with Most Monitoring Setups

Your Typical Web App Stack

Monitoring the Application Load Balancer

Monitoring RDS — The One Alarm That Matters

Monitoring SQS — Two Alarms, Full Coverage

Why You Can Skip EC2 Monitoring

Budget Monitoring — Don't Skip This

Wiring It All Together with SNS

The Full Alarm Set — Summary