knowledgment, background execution, and explicit status resolution.
Core Solution
Implementing the asynchronous job handle pattern requires restructuring how MCP tools are designed and how agents consume them. The solution separates tool registration into two distinct operations: task initiation and status resolution.
The first tool accepts input parameters, generates a unique reference identifier, queues the work, and returns immediately. It never blocks on external I/O.
import { v4 as uuidv4 } from 'uuid';
import { McpServer } from '@modelcontextprotocol/sdk/server/mcp.js';
// Production-grade state store interface
interface TaskRecord {
id: string;
status: 'pending' | 'processing' | 'completed' | 'failed';
payload: Record<string, unknown>;
result?: unknown;
error?: string;
createdAt: number;
}
const taskRegistry = new Map<string, TaskRecord>();
const server = new McpServer({
name: 'async-orchestrator',
version: '1.0.0'
});
server.tool(
'initiate_background_task',
'Submits a long-running operation and returns a tracking reference immediately.',
{
operation_type: { type: 'string', description: 'Target external service or pipeline' },
parameters: { type: 'object', description: 'Payload for the external call' }
},
async ({ operation_type, parameters }) => {
const taskId = uuidv4().slice(0, 8);
const record: TaskRecord = {
id: taskId,
status: 'pending',
payload: { operation_type, parameters },
createdAt: Date.now()
};
taskRegistry.set(taskId, record);
// Offload to background execution pool
processTaskAsync(taskId, operation_type, parameters);
return {
content: [{ type: 'text', text: JSON.stringify({ task_ref: taskId, status: 'queued' }) }]
};
}
);
Step 2: Implement Background Execution
The heavy lifting runs outside the MCP request lifecycle. This ensures the tool call returns within milliseconds, well below the 7-second implicit threshold.
async function processTaskAsync(taskId: string, operationType: string, params: Record<string, unknown>) {
const record = taskRegistry.get(taskId);
if (!record) return;
record.status = 'processing';
taskRegistry.set(taskId, record);
try {
// Simulate external API call with variable latency
const result = await executeExternalDependency(operationType, params);
record.status = 'completed';
record.result = result;
} catch (err) {
record.status = 'failed';
record.error = err instanceof Error ? err.message : 'Unknown execution error';
} finally {
taskRegistry.set(taskId, record);
}
}
async function executeExternalDependency(type: string, params: Record<string, unknown>) {
// Replace with actual HTTP client, SDK, or queue worker
await new Promise(res => setTimeout(res, 12000)); // Simulates 12s latency
return { data: `Processed ${type} with ${JSON.stringify(params)}` };
}
Agents poll this endpoint until the task reaches a terminal state. The tool returns structured metadata, allowing the agent to decide whether to continue, retry, or surface an error.
server.tool(
'resolve_task_status',
'Retrieves the current state and result of a previously submitted task.',
{ task_ref: { type: 'string', description: 'Reference ID from initiate_background_task' } },
async ({ task_ref }) => {
const record = taskRegistry.get(task_ref);
if (!record) {
return { content: [{ type: 'text', text: JSON.stringify({ error: 'Task reference not found' }) }] };
}
const response = {
task_ref: record.id,
status: record.status,
created_at: new Date(record.createdAt).toISOString(),
result: record.result ?? null,
error: record.error ?? null
};
return { content: [{ type: 'text', text: JSON.stringify(response) }] };
}
);
Architecture Decisions & Rationale
- Separation of Initiation and Resolution: MCP tools should be atomic. Combining submission and waiting violates the protocol's synchronous expectation and guarantees timeout failures.
- Immediate Return Guarantee: The initiation tool must complete in <500ms. Any I/O, validation, or queuing must be non-blocking or deferred.
- Explicit Terminal States:
completed and failed are distinct. Agents need to know when to stop polling. Ambiguous states like running or active cause indefinite loops.
- Stateless Tool Design: The tools themselves hold no memory. State lives in an external registry. This enables horizontal scaling and prevents node-specific failures.
- Polling Over Webhooks: While webhooks reduce polling overhead, they require external routing, TLS termination, and firewall configuration. Polling is simpler, more reliable for internal agent loops, and aligns with MCP's request-response model.
Pitfall Guide
1. In-Memory State Volatility
Explanation: Storing task records in a local Map or dictionary works for development but fails in production. Server restarts, scaling events, or crashes erase all pending tasks, leaving agents with orphaned references.
Fix: Persist task state to a distributed store (Redis, DynamoDB, PostgreSQL). Implement TTL policies to auto-expire stale records after 24-48 hours.
2. Polling Storms
Explanation: Agents that poll resolve_task_status every 100ms generate excessive network traffic and rate-limit the MCP server. This degrades performance for all concurrent workflows.
Fix: Implement exponential backoff on the agent side. Start with 1s intervals, double up to 10s, then cap at 30s. Most external APIs complete within 15-60 seconds; aggressive polling yields diminishing returns.
3. Orphaned Background Jobs
Explanation: If the background execution function crashes or the process terminates, tasks remain stuck in processing indefinitely. Agents poll forever, consuming tokens and blocking user sessions.
Fix: Add a heartbeat mechanism or maximum execution timeout. A background sweeper should mark tasks as failed if they exceed their SLA (e.g., 5 minutes). Log alerts for operational visibility.
4. Ignoring Terminal Failure States
Explanation: Returning null or empty strings on failure forces agents to guess whether the task is still running or has errored. This breaks deterministic workflow logic.
Fix: Always return a structured payload with explicit status and error fields. Agents should branch logic based on status === 'failed' rather than parsing text responses.
5. Hardcoded Timeout Assumptions
Explanation: Assuming all external calls will finish within 10 seconds leads to brittle designs. Batch jobs, ML inference pipelines, and third-party rate limits vary wildly.
Fix: Make SLAs configurable per operation type. Pass max_wait_seconds in the initiation payload and enforce it in the background worker. Return structured timeout errors when exceeded.
6. Lack of Idempotency
Explanation: Network retries or agent loops may call initiate_background_task multiple times for the same logical operation, spawning duplicate jobs and wasting compute.
Fix: Accept an idempotency_key in the initiation payload. Check the registry before creating a new record. If the key exists, return the existing task_ref instead of spawning a duplicate.
7. Overloading the LLM with Polling Logic
Explanation: Prompting the LLM to manually write polling loops wastes context tokens and introduces non-deterministic behavior. The model may forget to poll, poll too aggressively, or misinterpret status strings.
Fix: Use agent framework primitives (e.g., Strands MCPClient, LangGraph ToolNode) to handle polling automatically. The tool should only return status; the orchestrator manages the retry loop.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Fast lookup (<2s) | Synchronous direct call | Minimal latency, simpler code, no state management overhead | Lowest (no queue/registry) |
| Unpredictable external API (2-30s) | Async HandleId | Prevents agent freeze, enables graceful degradation, scales horizontally | Medium (state store + background workers) |
| Batch processing / ML inference (>30s) | Async HandleId + Webhook callback | Polling becomes inefficient; webhooks push results when ready | Higher (infrastructure + routing) |
| Critical path with strict SLA | Async HandleId + Circuit Breaker | Fails fast on degradation, preserves agent stability | Medium (monitoring + fallback logic) |
Configuration Template
// mcp-async-config.ts
export const MCP_ASYNC_CONFIG = {
server: {
name: 'production-async-tools',
version: '2.1.0',
transport: 'stdio' // or 'sse' for remote
},
state: {
provider: 'redis',
ttlSeconds: 86400,
keyPrefix: 'mcp_task:'
},
execution: {
maxConcurrency: 50,
defaultTimeoutSeconds: 120,
retryAttempts: 0 // Handled by agent orchestrator
},
polling: {
initialIntervalMs: 1000,
maxIntervalMs: 30000,
backoffMultiplier: 2.0,
maxAttempts: 20
},
observability: {
metricsPrefix: 'mcp_async',
logLevel: 'info',
alertOnFailureRate: 0.05 // 5%
}
};
Quick Start Guide
- Initialize the MCP Server: Install
@modelcontextprotocol/sdk and redis. Configure the state provider in MCP_ASYNC_CONFIG. Start the server process locally or deploy to a container.
- Register Tools: Export
initiate_background_task and resolve_task_status using the server SDK. Ensure the initiation tool returns within 200ms.
- Configure Agent Polling: In your agent framework (Strands, LangGraph, etc.), set the polling interval to 1s with exponential backoff. Map
status === 'completed' to workflow continuation and status === 'failed' to error handling.
- Validate with Synthetic Load: Run a load test simulating 50 concurrent tasks with 10-15s latency. Verify no 424 errors, state persistence across restarts, and correct backoff behavior.
- Deploy with Observability: Enable metrics collection for queue depth, task duration, and failure rates. Set alerts for failure rate >5% or average duration >SLA. Monitor agent token consumption to confirm context window efficiency gains.