AWS DevOps Agent is Generally Available — and It's a Big Deal
AWS just shipped an autonomous SRE that investigates incidents, learns your runbooks, indexes your code, and integrates with your entire observability stack — let's break down what actually changed from Preview to GA.
The Short Version
On March 31, 2026, AWS announced the general availability of AWS DevOps Agent — an autonomous operations agent that investigates incidents, prevents future ones, and handles on-demand SRE tasks across AWS, Azure, and on-premises environments. If you were in the preview, the GA release adds multicloud support, a Triage Agent, PagerDuty + Grafana integrations, private MCP, and six-region availability.
For those of us who've spent 2AM staring at CloudWatch dashboards trying to correlate five different tools — this one hits different.
The Problem It Actually Solves
Let me paint the picture every ops engineer knows: 3AM alert fires, you open five browser tabs — CloudWatch, your APM tool, the deployment pipeline, GitHub, and a Slack thread from a week ago. You're mentally joining data across all of them trying to figure out if this correlates with last Tuesday's deploy. That's operational toil. It's slow, error-prone, and it burns out good engineers.
AWS DevOps Agent tackles this by acting as a senior DevOps engineer who already knows your system. It learns your application topology, your runbooks, your deployment patterns, and your historical incident data — then investigates autonomously the moment an alert fires.
How It Actually Works
AWS DevOps Agent covers the full incident lifecycle. Here's the breakdown:
1. Autonomous Incident Response
The moment an alert fires — PagerDuty, CloudWatch, Datadog, whatever — the agent starts investigating without waiting for a human. It correlates telemetry, deployment history, code changes, and runbook knowledge to identify root cause and generate a mitigation plan. The output is an investigation journal: a timestamped, step-by-step record of what the agent found and why.
WGU's SRE team saw a real production investigation shrink from an estimated two hours to 28 minutes using this. The agent pinpointed a Lambda function configuration issue and surfaced operational knowledge buried in internal documentation that the team hadn't found manually.
2. Proactive Incident Prevention
After enough incident data, the agent switches from reactive to proactive. It analyzes patterns across your historical incidents and surfaces targeted recommendations — things like "this service fails every time memory utilization crosses 78%" or "IAM permission gaps in this role have caused 3 of your last 5 incidents."
This is the feature I'm most excited about long-term. Firefighting is expensive. Prevention compounds.
3. On-Demand SRE Tasks
Think of this as a conversational interface to your entire infrastructure. Query deployment history, alarm status, resource configurations, and incident patterns in plain English. You can build and save custom charts, share reports, and ask things like:
# Real queries you can run against your environment
"What changed in the payment service in the last 24 hours?"
"Show me all ECS services with error rates above 1% this week"
"Which Lambda functions had cold start times over 500ms yesterday?"
"What was the root cause of last Tuesday's API timeout incident?"
What's New in GA (vs Preview)
The preview was already impressive, but GA brings some meaningful upgrades that make this production-ready at scale.
Triage Agent
This is immediately useful if you're dealing with noisy alert environments. The Triage Agent automatically assesses incident severity and detects duplicate tickets — when it finds duplicates, it links them to the main investigation with a LINKED status so they won't spawn parallel investigations. If your on-call team gets 40 alerts during an outage and 35 of them are the same downstream symptom, this saves significant cognitive load.
Code Indexing
The agent now indexes your application code repositories. This is a big deal for root cause accuracy — instead of stopping at "elevated error rate in Lambda function X," it can now say "elevated error rate in Lambda function X, likely caused by this unhandled exception on line 47 of handler.py, introduced in commit abc123 two hours ago."
Learned Skills & Custom Skills
Over time, the agent builds skills from how your team resolves incidents — your unique patterns, tool preferences, and topology. You can also inject custom skills: runbooks, investigation procedures, and institutional knowledge specific to your systems.
# Example: Custom skill for database incidents
# Create once, applied automatically across all relevant investigations
skill: "postgres-connection-pool-exhaustion"
trigger: ["FATAL: remaining connection slots", "max_connections"]
applies_to: ["Incident RCA", "Incident Mitigation"]
steps:
- check_pg_stat_activity # who's holding connections
- check_rds_metrics # DatabaseConnections CW metric
- review_pgbouncer_config # pool sizing
- check_recent_deploys # connection leak in new code?
New Integrations Worth Knowing
The integration list was already solid in preview (Datadog, Dynatrace, New Relic, Splunk, GitHub, GitLab, ServiceNow). GA adds:
| Integration | What it unlocks | Status |
|---|---|---|
| PagerDuty | Native alert routing — PD fires, agent auto-investigates | New GA |
| Grafana | Connects to any Grafana instance (self-managed, Cloud, Amazon Managed). Pulls all configured datasources — Prometheus, Loki, OpenSearch | New GA |
| Azure DevOps | Tracks deployments and code changes in Azure Pipelines — cross-cloud correlation | New GA |
| Amazon EventBridge | Investigation events published to EventBridge for custom automation workflows | New GA |
| Private MCP | Connect your internal tools securely — no confidential data over the internet | New GA |
| On-Premises (MCP) | Extend incident investigation to on-prem workloads via Model Context Protocol | New GA |
Multicloud & On-Premises Support
This is probably the most strategically significant addition in GA. AWS DevOps Agent now investigates incidents in Azure workloads and correlates data across multicloud deployments. For organizations running a hybrid cloud strategy, this means unified incident response regardless of where the workload lives.
On-premises coverage works through MCP — the agent discovers on-prem resources by analyzing metrics, logs, and code to build a comprehensive topology map. T-Mobile's team, whose application logs live in an on-premises Splunk deployment, called this out specifically as a key capability during their pilot.
Regional Availability
GA launches across six AWS regions:
- US East (N. Virginia) — us-east-1
- US West (Oregon) — us-west-2
- EU (Frankfurt) — eu-central-1
- EU (Ireland) — eu-west-1
- AP (Sydney) — ap-southeast-2
- AP (Tokyo) — ap-northeast-1
No AP South (Mumbai) at launch — if that's a data residency requirement for you, worth checking with your AWS account team on the roadmap.
Getting Started
AWS DevOps Agent is live in the console now. Here's the practical path to first value:
# Step 1 — Create your Agent Space
# Navigate to: AWS Console → AWS DevOps Agent
# URL: https://console.aws.amazon.com/aidevops/
# Step 2 — Connect your observability tools
# Start with whatever you already use:
Datadog, Grafana, CloudWatch, Dynatrace, New Relic, Splunk
# Step 3 — Connect your code repo
# GitHub or GitLab — this powers Code Indexing
# Step 4 — Reinvestigate a recent incident
# Pick something from the last 30 days your team already solved
# Run it through the agent — compare findings + time
# This is the fastest way to prove ROI internally
Pricing Model
You pay per second of agent activity — no upfront commitments, no minimum usage. AWS Support customers get monthly credits toward usage based on a percentage of their gross AWS Support spend (percentage varies by support tier).
For detailed numbers, check the AWS DevOps Agent pricing page — I won't quote figures here since they'll evolve as the service matures.
My Take
I've been following this since the preview announcement, and the GA feature set makes this production-viable for most engineering teams. The numbers from early adopters are hard to ignore — WGU's 2-hour investigation collapsing to 28 minutes, Zenchef resolving a production issue without pulling engineers off a hackathon. These are real operational wins, not marketing metrics.
The features I think will matter most in practice:
- Code Indexing — closes the gap between "symptom detected" and "here's the specific line causing it"
- Triage Agent — alert fatigue is real; deduplication is underrated
- Learned Skills — the compounding value as the agent understands your patterns better over time
- Private MCP — enterprise teams blocked by data egress concerns now have a path forward
The thing that would push this from "impressive" to "essential" for me is deeper cost attribution and a way to measure how many incidents were prevented (not just resolved faster). That's a hard metric to surface, but it's the one that would justify this at budget review time.
Worth spinning up in a dev environment this week and reinvestigating a past incident. The bar for demonstrating value to your team is low — just show the side-by-side comparison.