Production Observability Stack: Logs, Metrics, Traces on AWS

The Three Pillars

Observability is more than monitoring. Monitoring tells you when something is broken. Observability tells you why. The three pillars — logs, metrics, and traces — are most powerful when correlated: a spike in your error rate metric (metrics) leads you to a trace showing which service call is failing (tracing), which links to the actual error message (logs).

Observability Architecture

Unified Observability Platform on AWS

OpenTelemetry — Instrument Once, Export Anywhere

We use OpenTelemetry (OTel) as the single instrumentation layer. Application code emits traces, metrics, and logs via OTel SDKs. The OTel Collector receives all signals and fans them out to CloudWatch, Prometheus, and X-Ray. If we swap backends, application code doesn't change.

# OTel Collector config — fan-out to all backends
receivers:
  otlp:
    protocols: { grpc: { endpoint: "0.0.0.0:4317" } }

exporters:
  awsxray: {}
  prometheusremotewrite:
    endpoint: "https://aps.us-east-1.amazonaws.com/workspaces/ws-xxx/api/v1/remote_write"
  awscloudwatchlogs:
    log_group_name: /app/services

service:
  pipelines:
    traces:   { receivers: [otlp], exporters: [awsxray] }
    metrics:  { receivers: [otlp], exporters: [prometheusremotewrite] }
    logs:     { receivers: [otlp], exporters: [awscloudwatchlogs] }

Structured Logging

Unstructured logs are grep-only. Structured JSON logs are queryable, filterable, and parseable by machines. Every log line should include: timestamp, severity, service name, trace ID, and the specific fields relevant to the event.

// Structured log with correlation IDs
logger.info({
  event: "order.created",
  orderId: order.id,
  userId: user.id,
  amount: order.total,
  traceId: context.traceId,    // link to distributed trace
  durationMs: elapsed,
});

Alert Design: Reduce Fatigue

Alert fatigue kills on-call culture. Every alert that fires at 3am and turns out to be nothing trains engineers to ignore alerts. Our rules: only alert on symptoms, not causes (alert on high error rate, not high CPU), use multi-window burn rate for SLO-based alerts, and require a runbook link on every alert.

# SLO-based alert using burn rate (Prometheus)
- alert: HighErrorBurnRate
  expr: >
    (
      sum(rate(http_requests_total{status=~"5.."}[1h]))
      / sum(rate(http_requests_total[1h]))
    ) > 14.4 * 0.001   # 14.4× burn rate on 99.9% SLO
  for: 2m
  labels: { severity: critical }
  annotations:
    runbook: https://wiki.internal/runbooks/high-error-rate
    summary: "Error budget burning fast — check traces in Grafana"

💡Start with fewer, higher-quality alerts. 5 alerts that always mean something is wrong are infinitely more valuable than 50 alerts that are 80% noise. Review your alert-to-incident ratio monthly.