Building an Autonomous SRE Incident Response System Using AWS Strands Agents SDK

Wait 5 sec.

Follow this guide to learn how to automate CloudWatch alerts, Kubernetes remediation, and incident reporting using multi-agent AI workflows with the AWS Strands Agents SDK.The SRE Incident Response Agent is a multi-agent sample that ships with the AWS Strands Agents SDK. It automatically discovers active CloudWatch alarms, performs AI-powered root cause analysis using Claude Sonnet 4 on Amazon Bedrock, proposes Kubernetes or Helm remediations, and posts a structured incident report to Slack.This guide covers everything you need to clone the repo and run it yourself.PrerequisitesBefore you begin, make sure the following are in place:Python 3.11+ installed on your machineAWS credentials configured (aws configure or an active IAM role)Amazon Bedrock access enabled for Claude Sonnet 4 in your target regionkubectl and helm v3 installed — only required if you plan to run live remediations. Dry-run mode works without them.Step 1: Clone the RepositoryThe sample lives inside the strands-agents/samples open source repository. Clone it and navigate to the SRE agent directory:\git clone https://github.com/strands-agents/samples.gitcd samples/02-samples/sre-incident-response-agentThe directory contains the following files:\sre-incident-response-agent/├── sre_agent.py           # Main agent: 4 agents + 8 tools├── test_sre_agent.py      # Pytest unit tests (12 tests, mocked AWS)├── requirements.txt├── .env.example└── README.mdStep 2: Create a Virtual Environment and Install Dependencies\python -m venv .venvsource .venv/activate # Windows: .venv\Scripts\activatepip install -r requirements.txtThe requirements.txt pins the core dependencies:\strands-agents>=0.1.0strands-agents-tools>=0.1.0boto3>=1.38.0botocore>=1.38.0Step 3: Configure Environment VariablesCopy .env.example to .env and fill in your values:\cp .env.example .envOpen .env and set the following:\# AWS region where your CloudWatch alarms liveAWS_REGION=us-east-1# Amazon Bedrock model ID (Claude Sonnet 4 is the default)BEDROCK_MODEL_ID=us.anthropic.claude-sonnet-4-20250514-v1:0# DRY_RUN=true means kubectl/helm commands are printed, not executed.# Set to false only when you are ready for live remediations.DRY_RUN=true# Optional: post the incident report to Slack.# Leave blank to print to stdout instead.SLACK_WEBHOOK_URL=Step 4: Grant IAM PermissionsThe agent needs read-only access to CloudWatch alarms, metric statistics, and log events. No write permissions to CloudWatch are required. Attach the following policy to the IAM role or user running the agent:\{ "Version": "2012-10-17", "Statement": [{ "Effect": "Allow", "Action": [ "cloudwatch:DescribeAlarms", "cloudwatch:GetMetricStatistics", "logs:FilterLogEvents", "logs:DescribeLogGroups" ], "Resource": "*" }]}Step 5: Run the AgentThere are two ways to trigger the agent.Option A: Automatic Alarm DiscoveryLet the agent discover all active CloudWatch alarms on its own. This is the recommended mode for a real on-call scenario:\python sre_agent.py\Option B: Targeted InvestigationPass a natural-language description of the triggering event. The agent will focus its investigation on the service and symptom you describe:\python sre_agent.py "High CPU alarm fired on ECS service my-api in prod namespace"Example OutputRunning the targeted trigger above produces output similar to the following:\Starting SRE Incident Response Trigger: High CPU alarm fired on ECS service my-api in prod namespace[cloudwatch_agent] Fetching active alarms... Found alarm: my-api-HighCPU (CPUUtilization > 85% for 5m) Metric stats: avg 91.3%, max 97.8% over last 30 min Log events: 14 OOMKilled events in /ecs/my-api[rca_agent] Performing root cause analysis... Root cause: Memory leak causing CPU spike as GC thrashes Severity: P2 - single service,