onment. Start by defining a factory that instantiates agents with consistent sandbox and context settings.
import { AgentRuntime, SandboxConfig, ContextIndex } from '@cursor/sdk';
interface AgentBlueprint {
modelId: string;
sandbox: SandboxConfig;
context: ContextIndex;
tools: string[];
}
export class AgentFactory {
static create(blueprint: AgentBlueprint) {
const runtime = new AgentRuntime({
model: blueprint.modelId,
sandbox: {
ephemeral: true,
mountRepo: true,
networkIsolation: true,
credentialScoping: 'readonly'
},
context: {
indexingStrategy: 'hybrid',
maxTokens: 128000,
pruningThreshold: 0.75
},
tools: blueprint.tools
});
return runtime.initialize();
}
}
Why this structure? Ephemeral sandboxes prevent credential leakage and limit blast radius. Hybrid indexing combines semantic embeddings for conceptual search with exact-match grep for symbol resolution, reducing wasted tokens on irrelevant context. The credentialScoping: 'readonly' default enforces least-privilege access until explicit write permissions are granted via approval hooks.
Step 2: Implement Dynamic Model Routing
Hardcoding a single model leads to inefficient token consumption. A production system routes tasks based on complexity, required reasoning depth, and cost constraints.
export class TaskRouter {
private static readonly SPECIALIZED = 'composer-2';
private static readonly FRONTIER = 'claude-opus-4';
private static readonly LIGHTWEIGHT = 'gpt-4o-mini';
static resolveTaskType(task: string): string {
const complexityIndicators = [
/refactor|rename|format|lint/i,
/architect|design|system|tradeoff|migration/i,
/debug|trace|rootcause|performance/i
];
if (complexityIndicators[0].test(task)) return this.LIGHTWEIGHT;
if (complexityIndicators[1].test(task)) return this.FRONTIER;
if (complexityIndicators[2].test(task)) return this.SPECIALIZED;
return this.SPECIALIZED; // Default fallback
}
static async execute(task: string, context: string) {
const model = this.resolveTaskType(task);
const agent = AgentFactory.create({
modelId: model,
sandbox: { ephemeral: true, mountRepo: true },
context: { indexingStrategy: 'hybrid' },
tools: ['file-read', 'terminal-exec', 'git-branch']
});
return agent.run(`${task}\n\nContext:\n${context}`);
}
}
Why this approach? Routing logic separates business intent from execution. By mapping task patterns to model capabilities, you avoid burning frontier tokens on formatting or simple renames. Composer 2 handles debugging and refactoring at a fraction of the cost, while frontier models are reserved for architectural synthesis. The fallback ensures no task fails due to routing ambiguity.
The Model Context Protocol (MCP) standardizes how agents interact with external systems. Instead of embedding tool logic inside the agent, register tools as discrete, versioned services.
import { McpToolRegistry, ToolDefinition } from '@cursor/sdk/mcp';
const ciPipelineTool: ToolDefinition = {
name: 'ci-trigger',
description: 'Triggers a specific CI pipeline and returns status',
parameters: {
pipeline: { type: 'string', required: true },
branch: { type: 'string', required: true }
},
handler: async (params: any) => {
const response = await fetch(`https://ci.internal/api/run`, {
method: 'POST',
headers: { 'Authorization': `Bearer ${process.env.CI_TOKEN}` },
body: JSON.stringify(params)
});
return response.json();
}
};
export const toolRegistry = new McpToolRegistry();
toolRegistry.register(ciPipelineTool);
Why MCP? Decoupling tools from the agent runtime enables independent testing, versioning, and security auditing. The registry pattern allows hot-swapping implementations without modifying agent logic. Pre-execution hooks can validate parameters and enforce dry-run modes before touching production systems.
Long-running refactors or multi-step debugging sessions require state recovery. The SDK handles baseline persistence, but production systems benefit from explicit checkpoint strategies.
export class SessionManager {
static async saveCheckpoint(agentId: string, metadata: Record<string, any>) {
await agentId.saveState({
timestamp: Date.now(),
branch: metadata.branch,
pendingChanges: metadata.diff,
contextSnapshot: metadata.contextHash
});
}
static async restore(agentId: string) {
const state = await agentId.loadState();
if (!state) throw new Error('No checkpoint found');
return agentId.resume(state);
}
}
Why explicit checkpoints? Network interruptions, laptop sleep cycles, and CI feedback loops frequently interrupt agent sessions. Automatic recovery prevents token waste from re-exploring already-resolved paths. Storing context hashes enables diff-aware resumption, ensuring the agent only re-processes changed files.
Pitfall Guide
1. Unbounded Context Injection
Explanation: Feeding entire repositories or large dependency trees into the prompt causes token bloat, degrades reasoning quality, and inflates costs.
Fix: Implement context pruning with semantic relevance scoring. Use hybrid search to fetch only files directly referenced by symbols, imports, or recent git diffs. Set a hard token budget per session and enforce truncation strategies.
2. Sandbox Credential Leakage
Explanation: Running agents with broad environment access exposes production keys, cloud tokens, and internal secrets to untrusted execution paths.
Fix: Enforce ephemeral VMs with scoped credentials. Use read-only mounts by default and require explicit approval hooks for write operations. Rotate secrets per session and audit all terminal commands via pre-execution validation.
3. Static Model Assignment
Explanation: Binding all tasks to a single model ignores cost/quality trade-offs. Frontier models waste tokens on trivial tasks; lightweight models fail on complex reasoning.
Fix: Implement dynamic routing based on task classification. Track cost-per-task metrics and adjust routing thresholds quarterly. Maintain a fallback chain to prevent task failure during model outages.
4. Ignoring Session State Drift
Explanation: Agents operating across multiple branches or repositories can lose track of which environment they're modifying, leading to cross-contamination or broken builds.
Fix: Bind each session to a strict worktree or branch lock. Use checkpoint metadata to verify environment consistency before resuming. Implement branch isolation at the sandbox level to prevent accidental cross-repo mutations.
Explanation: Agents executing terminal commands or API calls without validation can trigger destructive operations, rate limits, or unintended deployments.
Fix: Wrap all tool handlers in a validation layer that supports --dry-run mode. Require explicit approval for destructive actions (e.g., git push, rm -rf, production deployments). Log all tool invocations for audit trails.
6. Over-Optimizing for Benchmarks
Explanation: Chasing SWE-bench Pro or Terminal-Bench 2.0 scores without measuring real-world developer impact leads to misaligned investments.
Fix: Track operational metrics: time-to-merge, review cycle reduction, token cost per resolved issue, and developer satisfaction scores. Benchmarks measure capability; production metrics measure value.
7. Neglecting Token Budgeting
Explanation: Treating token consumption as an afterthought results in unpredictable costs and inefficient resource allocation.
Fix: Implement per-sprint token budgets with automated alerts at 70% and 90% thresholds. Route low-complexity tasks to lightweight models and reserve frontier capacity for architectural decisions. Publish cost dashboards to align engineering and finance.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Routine refactoring, formatting, or symbol renaming | Lightweight model + Composer 2 fallback | High accuracy on syntactic tasks; minimal reasoning required | ↓ 85% vs frontier |
| Complex architecture design or system migration | Frontier generalist (Claude Opus / GPT-4o) | Requires multi-artifact synthesis and trade-off analysis | ↑ 3x per task, but reduces rework |
| CI/CD pipeline automation or test generation | Composer 2 + MCP tool registry | Specialized training on terminal execution and test frameworks | ↓ 70% vs generalist |
| Multi-repository sync or cross-service debugging | Harness SDK with branch isolation + checkpointing | Prevents context drift; enables parallel session management | Neutral (infrastructure cost) |
| Emergency hotfix with strict SLA | Frontier model + dry-run validation + manual approval | Prioritizes speed and accuracy over cost; safety gates prevent regression | ↑ 4x, but mitigates downtime cost |
Configuration Template
// agent.config.ts
import { AgentFactory, TaskRouter, SessionManager, McpToolRegistry } from './core';
export const defaultConfig = {
sandbox: {
ephemeral: true,
mountRepo: true,
networkIsolation: true,
credentialScoping: 'readonly',
maxExecutionTime: 300 // seconds
},
context: {
indexingStrategy: 'hybrid',
maxTokens: 128000,
pruningThreshold: 0.75,
includeGitDiffs: true
},
routing: {
lightweight: 'gpt-4o-mini',
specialized: 'composer-2',
frontier: 'claude-opus-4',
fallback: 'composer-2'
},
hooks: {
preExecution: 'validate-command',
postCommit: 'run-tests',
onDisconnect: 'auto-checkpoint'
}
};
export const tools = new McpToolRegistry();
tools.register({
name: 'git-branch',
handler: async (params) => { /* implementation */ }
});
tools.register({
name: 'ci-trigger',
handler: async (params) => { /* implementation */ }
});
export const agentOrchestrator = {
create: (task: string) => TaskRouter.execute(task, ''),
resume: (sessionId: string) => SessionManager.restore(sessionId),
save: (sessionId: string, meta: any) => SessionManager.saveCheckpoint(sessionId, meta)
};
Quick Start Guide
- Install SDK & Dependencies: Run
npm install @cursor/sdk @cursor/sdk/mcp and initialize your project with TypeScript strict mode enabled.
- Configure Sandbox & Context: Copy the
defaultConfig template into your project root. Adjust maxTokens and pruningThreshold based on your repository size.
- Register Tools: Define your MCP tools in a dedicated registry file. Implement dry-run validation and credential scoping before production use.
- Initialize & Test: Call
agentOrchestrator.create('Refactor UserService to use repository pattern') in a test script. Verify sandbox isolation, context retrieval, and model routing before scaling to team workflows.