Production Observability Stack: Logs, Metrics, Traces on AWS
Building a unified observability platform on AWS — structured logging with CloudWatch, distributed tracing with X-Ray and Jaeger, metrics with Prometheus + Grafana, and effective alerting that reduces alert fatigue.
The Three Pillars
Observability is more than monitoring. Monitoring tells you when something is broken. Observability tells you why. The three pillars — logs, metrics, and traces — are most powerful when correlated: a spike in your error rate metric (metrics) leads you to a trace showing which service call is failing (tracing), which links to the actual error message (logs).
Observability Architecture
OpenTelemetry — Instrument Once, Export Anywhere
We use OpenTelemetry (OTel) as the single instrumentation layer. Application code emits traces, metrics, and logs via OTel SDKs. The OTel Collector receives all signals and fans them out to CloudWatch, Prometheus, and X-Ray. If we swap backends, application code doesn't change.
# OTel Collector config — fan-out to all backends
receivers:
otlp:
protocols: { grpc: { endpoint: "0.0.0.0:4317" } }
exporters:
awsxray: {}
prometheusremotewrite:
endpoint: "https://aps.us-east-1.amazonaws.com/workspaces/ws-xxx/api/v1/remote_write"
awscloudwatchlogs:
log_group_name: /app/services
service:
pipelines:
traces: { receivers: [otlp], exporters: [awsxray] }
metrics: { receivers: [otlp], exporters: [prometheusremotewrite] }
logs: { receivers: [otlp], exporters: [awscloudwatchlogs] }
Structured Logging
Unstructured logs are grep-only. Structured JSON logs are queryable, filterable, and parseable by machines. Every log line should include: timestamp, severity, service name, trace ID, and the specific fields relevant to the event.
// Structured log with correlation IDs
logger.info({
event: "order.created",
orderId: order.id,
userId: user.id,
amount: order.total,
traceId: context.traceId, // link to distributed trace
durationMs: elapsed,
});
Alert Design: Reduce Fatigue
Alert fatigue kills on-call culture. Every alert that fires at 3am and turns out to be nothing trains engineers to ignore alerts. Our rules: only alert on symptoms, not causes (alert on high error rate, not high CPU), use multi-window burn rate for SLO-based alerts, and require a runbook link on every alert.
# SLO-based alert using burn rate (Prometheus)
- alert: HighErrorBurnRate
expr: >
(
sum(rate(http_requests_total{status=~"5.."}[1h]))
/ sum(rate(http_requests_total[1h]))
) > 14.4 * 0.001 # 14.4× burn rate on 99.9% SLO
for: 2m
labels: { severity: critical }
annotations:
runbook: https://wiki.internal/runbooks/high-error-rate
summary: "Error budget burning fast — check traces in Grafana"