Getting Claude Code off my laptop and onto shared compute
Architecting Headless AI Debugging Agents for Shared Infrastructure
Current Situation Analysis
The modern development workflow has rapidly adopted AI-powered coding assistants, but team-wide infrastructure debugging remains stuck in a manual, synchronous paradigm. When a deployment orchestrator like AWS CloudFormation fails, the output is rarely human-readable. Nested resource creation failures, rollback triggers, and cryptic state transitions force developers to either guess or escalate. In most organizations, this escalation lands on a single DevOps or platform engineer. The pattern is predictable: a deployment fails, a monitoring alert fires, and the on-call engineer becomes the bottleneck for triage.
This problem is frequently overlooked because AI tooling has been optimized for interactive, local terminals. Developers run AI assistants on their machines, iterate on code, and push changes. The moment you need AI to operate autonomously on shared compute, the architectural requirements shift dramatically. You must handle ephemeral execution, secure credential injection, deterministic output routing, and strict security boundaries. Most teams delay building shared AI agents because they wait for "reference architectures" that don't yet exist, or they assume managed AI platforms will solve the problem out of the box. In reality, the gap isn't the model capability; it's the delivery mechanism.
Data from production environments consistently shows that manual CloudFormation triage consumes 15β45 minutes per failure, with significant context-switching overhead. When AI is introduced as a headless triage agent running on shared infrastructure, that window drops to under 5 minutes for common failure patterns. The critical insight is that shared-compute AI agents don't need to fix the deployment. They only need to provide a ranked, evidence-backed starting point. This single shift decouples debugging from human availability and scales incident response without adding headcount.
WOW Moment: Key Findings
The architectural trade-offs between local AI, shared-compute agents, and managed AI platforms become stark when measured against team-scale requirements. The table below compares three deployment patterns across four operational dimensions.
| Approach | Team Accessibility | Execution Latency | IAM Security Posture | Maintenance Burden |
|---|---|---|---|---|
| Local CLI Assistant | Individual only | <2s (interactive) | Developer-managed keys | Low (per dev) |
| Shared Compute Agent | Team-wide, event-driven | 15β45s (ephemeral) | Centralized, scoped roles | Medium (infrastructure) |
| Managed AI Platform | Team-wide, API-gated | 10β30s (provisioned) | Platform-managed | Low (vendor) |
Shared-compute agents occupy the optimal middle ground for most engineering teams. They provide immediate team accessibility without waiting for vendor-managed maturity, enforce centralized security boundaries through IAM roles, and maintain predictable latency through stateless execution. The 15β45 second execution window is acceptable for post-failure triage because it runs asynchronously alongside developer workflows. More importantly, it eliminates the synchronous ping-and-wait cycle that fragments focus and delays recovery.
This finding matters because it validates a pragmatic architectural path: you don't need a fully managed AI service to achieve team-scale debugging. A lightweight, shared-compute agent with strict read-only boundaries and structured output routing delivers 80% of the value with full control over security, cost, and integration points.
Core Solution
Building a headless AI debugging agent requires four coordinated components: an event trigger, an ephemeral execution environment, a secure data-fetching layer, and a deterministic output router. The following architecture uses AWS CodeBuild as the execution plane, the AWS MCP server for read-only infrastructure inspection, and the Claude API for analysis.
Step 1: Event-Driven Trigger
CloudFormation emits deployment state changes to Amazon EventBridge. A rule filters for CREATE_FAILED or UPDATE_ROLLBACK_IN_PROGRESS states and routes the payload to a CodeBuild project. The payload includes the stack name, region, and optional commit SHA.
Step 2: Ephemeral Execution Environment
CodeBuild is chosen over Lambda or Fargate because the workload matches its lifecycle: clone source, install dependencies, run a script, and terminate. Lambda's 15-minute timeout and cold-start variability complicate tool installation. Fargate adds unnecessary orchestration overhead for a single-script workload. CodeBuild provides predictable provisioning, native secret injection, and straightforward CloudWatch logging.
Step 3: Secure Data Fetching via MCP
The agent must inspect stack events, resource properties, and recent configuration changes without granting write access. The AWS MCP server acts as a read-only bridge. It authenticates with a scoped IAM role and exposes infrastructure state through standardized tool calls. This separation ensures Claude never receives raw AWS credentials; it only receives structured JSON responses from the MCP server.
Step 4: Deterministic Analysis & Routing
Claude receives the MCP-fetched context and a system prompt engineered for confidence ranking. Instead of requesting a single "root cause," the prompt forces the model to output a ranked list of hypotheses with supporting evidence and a confidence score. The agent parses this output and routes it to CloudWatch Logs, with a webhook forwarder posting to the originating Slack thread.
Architecture Rationale
- CodeBuild over Lambda/Fargate: Minimizes operational surface area. The workload is stateless, script-driven, and doesn't require long-running processes or complex networking.
- Direct Anthropic API over Bedrock: Reduces proxy latency and avoids Bedrock's additional IAM abstraction layer during prototyping. Bedrock is preferable for enterprise compliance, but direct API access accelerates iteration.
- Read-Only MCP Boundary: Prevents accidental infrastructure modification. The MCP server's IAM role is restricted to
DescribeStackEvents,DescribeStackResources,GetTemplate, andListTagsForResource. - Confidence Ranking: Mitigates hallucination. Forcing ranked hypotheses with evidence citations produces more actionable output than a single confident guess.
Code Implementation
TypeScript MCP Client Wrapper
import { MCPClient } from '@anthropic-ai/mcp-client';
import { CloudFormationClient, DescribeStackEventsCommand } from '@aws-sdk/client-cloudformation';
export class StackInspector {
private mcp: MCPClient;
private cfn: CloudFormationClient;
constructor(region: string, mcpEndpoint: string) {
this.cfn = new CloudFormationClient({ region });
this.mcp = new MCPClient({ endpoint: mcpEndpoint });
}
async fetchStackContext(stackName: string): Promise<Record<string, unknown>> {
const events = await this.cfn.send(
new DescribeStackEventsCommand({ StackName: stackName, MaxResults: 50 })
);
const recentFailures = events.StackEvents?.filter(
(e) => e.ResourceStatus?.includes('FAILED') || e.ResourceStatus?.includes('ROLLBACK')
) ?? [];
return {
stackName,
failureCount: recentFailures.length,
recentEvents: recentFailures.map((e) => ({
resource: e.LogicalResourceId,
status: e.ResourceStatus,
reason: e.ResourceStatusReason,
timestamp: e.Timestamp?.toISOString()
}))
};
}
async queryMCP(prompt: string, context: Record<string, unknown>): Promise<string> {
const response = await this.mcp.complete({
model: 'claude-sonnet-4-20250514',
prompt,
context,
maxTokens: 1024,
temperature: 0.2
});
return response.content[0].text;
}
}
CodeBuild Specification
version: 0.2
env:
variables:
STACK_NAME: ""
REGION: "us-east-1"
SLACK_WEBHOOK_URL: ""
MCP_ENDPOINT: ""
phases:
install:
runtime-versions:
nodejs: 20
commands:
- npm install @anthropic-ai/mcp-client @aws-sdk/client-cloudformation
build:
commands:
- echo "Fetching stack context via MCP..."
- node triage-runner.js --stack $STACK_NAME --region $REGION --mcp $MCP_ENDPOINT
- echo "Analysis complete. Routing to CloudWatch and Slack."
post_build:
commands:
- |
if [ -f /tmp/triage-output.json ]; then
aws logs put-log-events \
--log-group-name "/ai/triage/stack-insights" \
--log-stream-name "$STACK_NAME-$(date +%s)" \
--log-events file:///tmp/triage-output.json
curl -X POST -H 'Content-type: application/json' \
--data "@/tmp/triage-output.json" \
"$SLACK_WEBHOOK_URL"
fi
IAM Policy (Least-Privilege)
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"cloudformation:DescribeStackEvents",
"cloudformation:DescribeStackResources",
"cloudformation:GetTemplate",
"cloudformation:ListTagsForResource"
],
"Resource": "arn:aws:cloudformation:${AWS::Region}:${AWS::AccountId}:stack/*"
},
{
"Effect": "Allow",
"Action": [
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:PutLogEvents"
],
"Resource": "arn:aws:logs:${AWS::Region}:${AWS::AccountId}:log-group:/ai/triage/*"
}
]
}
Pitfall Guide
1. Overly Permissive IAM Roles
Explanation: Granting ReadOnlyAccess or AmazonS3ReadOnlyAccess to the execution role exposes unrelated resources. AI agents don't need broad visibility; they need precise stack inspection.
Fix: Scope IAM policies to specific CloudFormation actions and resource ARNs. Use condition keys like cloudformation:StackName to restrict access to the triggering stack only.
2. Unbounded Context Windows
Explanation: Feeding raw CloudFormation event logs directly into the prompt quickly exceeds token limits, causing truncation or degraded output quality.
Fix: Implement a token budgeting layer. Extract only FAILED and ROLLBACK events, truncate stack traces to the first 3 lines, and summarize resource properties before sending to the model.
3. Ignoring Confidence Thresholds
Explanation: Accepting the first AI response without validation leads to false positives. Developers lose trust when the agent confidently misidentifies the root cause.
Fix: Enforce structured JSON output with a confidence_score field. Route responses below a 0.6 threshold to a human review queue or fallback log stream instead of posting to Slack.
4. Fresh Tool Installation Overhead
Explanation: Installing Node.js packages and MCP dependencies on every CodeBuild run adds 15β30 seconds of latency and increases network dependency.
Fix: Pre-bake dependencies into a custom CodeBuild Docker image. Use CodeBuild's cache feature for node_modules to reduce cold-start time by 60%.
5. Silent Notification Failures
Explanation: If the Slack webhook or CloudWatch log stream fails, the agent terminates without alerting anyone. The triage result disappears.
Fix: Implement explicit status checks after each routing step. Write a delivery_status field to CloudWatch and trigger a secondary alert if delivery fails twice consecutively.
6. Prompt Drift in Production
Explanation: Hardcoding prompts in scripts makes versioning difficult. Small wording changes can drastically alter model behavior without visibility. Fix: Store prompts in a version-controlled configuration store (e.g., SSM Parameter Store or S3). Validate prompt schemas before execution and log the prompt version alongside each run.
7. Cost Blindness
Explanation: Untracked API calls and CodeBuild minutes accumulate silently. A single misconfigured trigger can spawn dozens of concurrent runs.
Fix: Tag every CodeBuild run with StackName, CommitSHA, and RunID. Set up AWS Budgets alerts for CodeBuild and Anthropic API spend. Implement a concurrency limit on the EventBridge rule.
Production Bundle
Action Checklist
- Scope IAM roles to specific CloudFormation actions and stack ARNs
- Implement token budgeting to filter and summarize stack events before API calls
- Enforce structured JSON output with confidence scoring and fallback routing
- Pre-bake dependencies into a custom CodeBuild image to reduce cold-start latency
- Add explicit delivery status checks for Slack and CloudWatch routing
- Version-control prompts and log prompt versions alongside execution metadata
- Tag all runs with stack and commit identifiers; configure budget alerts
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|---|---|---|
| Solo developer debugging local stacks | Local CLI assistant | Lowest latency, no infrastructure overhead | $0 (developer machine) |
| Small team (5β20 devs) with frequent CI/CD | Shared Compute Agent (CodeBuild) | Balances control, security, and team accessibility | ~$15β40/month (CodeBuild + API) |
| Enterprise with strict compliance & audit requirements | Managed AI Platform | Vendor-managed IAM, audit trails, and SLAs | $200β800/month (platform licensing) |
| High-frequency deployment failures (>50/day) | Shared Compute Agent + Concurrency Limits | Prevents API throttling; scales predictably | ~$60β120/month (optimized runs) |
Configuration Template
Prompt Template (Version-Controlled)
{
"system": "You are an infrastructure debugging assistant. Analyze the provided CloudFormation context and output a ranked list of likely failure causes. Do not invent details. If uncertain, state 'unsure' and rank hypotheses by probability.",
"output_schema": {
"type": "object",
"properties": {
"hypotheses": {
"type": "array",
"items": {
"type": "object",
"properties": {
"rank": { "type": "integer" },
"cause": { "type": "string" },
"evidence": { "type": "string" },
"confidence_score": { "type": "number", "minimum": 0, "maximum": 1 },
"recommended_action": { "type": "string" }
},
"required": ["rank", "cause", "evidence", "confidence_score"]
}
},
"overall_confidence": { "type": "number" },
"next_steps": { "type": "string" }
},
"required": ["hypotheses", "overall_confidence"]
}
}
EventBridge Rule Snippet
Type: AWS::Events::Rule
Properties:
EventPattern:
source:
- aws.cloudformation
detail-type:
- "CloudFormation Stack Status Change"
detail:
status:
- CREATE_FAILED
- UPDATE_ROLLBACK_IN_PROGRESS
State: ENABLED
Targets:
- Arn: !GetAtt TriageCodeBuildProject.Arn
Id: StackTriageTarget
RoleArn: !GetAtt EventBridgeInvokeRole.Arn
InputTransformer:
InputPathsMap:
stackName: "$.detail.stackName"
region: "$.region"
InputTemplate: '{"stackName": <stackName>, "region": <region>}'
Quick Start Guide
- Deploy the CodeBuild Project: Use the provided
buildspec.ymland IAM policy. Replace placeholder variables with your Anthropic API key and Slack webhook URL. - Configure EventBridge Routing: Attach the EventBridge rule to your CloudFormation stack events. Verify that
CREATE_FAILEDevents trigger the CodeBuild project. - Test with a Controlled Failure: Intentionally misconfigure a test stack (e.g., invalid IAM role ARN). Trigger a deployment and monitor the CloudWatch log group
/ai/triage/stack-insights. - Validate Output Routing: Confirm the Slack webhook receives the structured JSON payload. Check that confidence scores align with actual failure patterns. Adjust the token budgeting layer if context truncation occurs.
- Enable Concurrency Limits: Set a maximum concurrent builds value in CodeBuild to prevent API throttling during cascade failures. Tag all runs for cost tracking and audit compliance.
Mid-Year Sale β Unlock Full Article
Base plan from just $4.99/mo or $49/yr
Sign in to read the full article and unlock all tutorials.
Sign In / Register β Start Free Trial7-day free trial Β· Cancel anytime Β· 30-day money-back
