Decoupling Agent CLIs from Inference Engines: A Responses API Gateway Pattern

Current Situation Analysis

The modern developer toolchain is rapidly converging on a single protocol: OpenAI's Responses API. Originally introduced as an extension to Chat Completions, it has become the de facto standard for agentic workflows, offering native support for tool execution, reasoning traces, and structured output streaming. However, this standardization comes with a hidden cost: protocol-level vendor lock-in.

OpenAI's Codex CLI exemplifies this shift. As of early 2026, the tool dropped Chat Completion support entirely, routing all interactions through the Responses API. The WireApi enum in the underlying codebase now contains a single variant: Responses. This architectural decision means that any developer wishing to use the Codex UX with alternative models—whether self-hosted Llama instances, Anthropic's Claude, or Google's Gemini—faces a hard boundary. The CLI does not speak the older protocol, and it does not natively route to non-OpenAI inference endpoints.

This problem is frequently misunderstood. Many teams assume that because a CLI is branded by a specific vendor, it is cryptographically or architecturally bound to that vendor's models. In reality, the Responses API is a transparent HTTP/SSE specification. The lock-in is purely contractual and implementation-level, not cryptographic. Developers overlook the fact that the protocol is just a structured stream of JSON events. If you can stand up a lightweight gateway that accepts Responses API payloads, translates them to a vendor-neutral chat abstraction, and streams the results back in the exact SSE format the CLI expects, the inference engine becomes completely interchangeable.

The industry pain point is clear: powerful agent UXs are trapped behind proprietary wire formats. The solution is not to rewrite the CLI, but to build a protocol translation layer that sits between the tool and the model provider.

WOW Moment: Key Findings

The critical insight is that the Responses API is a stateless streaming contract, not a proprietary black box. By implementing a thin gateway that maps POST /v1/responses to a vendor-agnostic IChatClient interface, you decouple the execution environment from the inference provider.

Approach	Model Flexibility	Protocol Compliance	Setup Overhead	Cost Efficiency
Native OpenAI Endpoint	OpenAI only	Full	Zero	High (vendor pricing)
Chat Completions Wrapper	High	Broken (Codex dropped support)	Medium	Variable
Responses API Gateway	Unlimited (Claude, Gemini, Llama, etc.)	Full	Low	Optimized (provider-agnostic)

This finding matters because it transforms a vendor-specific CLI into a generic agent orchestrator. You retain the sophisticated tool loop (shell, apply_patch, plan tracking) while routing inference to the most cost-effective or capable model available. The gateway acts as a protocol adapter, preserving the UX while eliminating inference lock-in.

Core Solution

Building a Responses API gateway requires three architectural components: an HTTP router that accepts the Responses payload, a state manager for conversation chaining, and an SSE emitter that reconstructs the exact event sequence the CLI expects. We will use Microsoft.Extensions.AI as the vendor-neutral abstraction layer, allowing any backend to plug into the same routing logic.

Step 1: Define the Gateway Router

The gateway exposes a single endpoint: POST /v1/responses. It must also serve a model catalog to prevent metadata fallback warnings. We'll use ASP.NET Core Minimal APIs for lightweight routing.

using Microsoft.Extensions.AI;
using System.Text.Json;
using System.Text.Json.Serialization;

var builder = WebApplication.CreateBuilder();
builder.Services.AddLogging();

var app = builder.Build();

// In-memory state for turn chaining
var conversationStore = new Dictionary<string, IList<ChatMessage>>();

app.MapPost("/v1/responses", async (HttpRequest request, IChatClient chatClient) =>
{
    var payload = await JsonSerializer.DeserializeAsync<ResponsesPayload>(request.Body);
    if (payload is null) return Results.BadRequest("Invalid payload");

    // Reconstruct conversation history if previous_response_id is provided
    var history = new List<ChatMessage>();
    if (!string.IsNullOrEmpty(payload.PreviousResponseId) && 
        conversationStore.TryGetValue(payload.PreviousResponseId, out var storedHistory))
    {
        history.AddRange(storedHistory);
    }

    // Map incoming messages to IChatMessage format
    foreach (var msg in payload.Input)
    {
        history.Add(new ChatMessage((ChatRole)msg.Role, msg.Content));
    }

    // Register Codex tools as passthrough schemas
    var toolDefinitions = payload.Tools.Select(t => 
        new AITool(t.Type, t.Function.Name, t.Function.Description, t.Function.Parameters)).ToList();

    // Generate unique response ID for chaining
    var responseId = Guid.NewGuid().ToString("N");
    conversationStore[responseId] = history;

    // Stream response back to client
    await StreamResponsesAsync(chatClient, history, toolDefinitions, responseId, request.HttpContext.Response);
    
    return Results.Ok();
});

app.MapGet("/v1/models", () => Results.Json(ModelCatalogRegistry.GetCatalog()));

app.Run("http://localhost:8080");

Step 2: Implement SSE Streaming Translation

The Responses API expects a strict sequence of Server-Sent Events. The gateway must translate ChatResponseUpdate objects from IChatClient into the exact event types Codex parses.

async Task StreamResponsesAsync(IChatClient client, IList<ChatMessage> history, IList<AITool> tools, string responseId, HttpResponse response)
{
    response.ContentType = "text/event-stream";
    response.Headers.CacheControl = "no-cache";

    // 1. Emit creation event
    await EmitSse(response, "response.created", new { id = responseId, status = "created" });

    // 2. Emit in-progress event
    await EmitSse(response, "response.in_progress", new { id = responseId });

    // 3. Stream output items
    var outputItemId = Guid.NewGuid().ToString("N");
    await EmitSse(response, "response.output_item.added", new { output_index = 0, item = new { id = outputItemId, type = "message" } });

    var fullResponse = new List<string>();
    
    await foreach (var update in client.GetStreamingResponseAsync(history, new() { Tools = tools }))
    {
        if (update.Contents.OfType<TextContent>().FirstOrDefault() is { } text)
        {
            await EmitSse(response, "response.output_text.delta", new { delta = text.Text });
            fullResponse.Add(text.Text);
        }
        
        if (update.Contents.OfType<FunctionCallContent>().FirstOrDefault() is { } funcCall)
        {
            await EmitSse(response, "response.function_call_arguments.delta", new { 
                call_id = funcCall.CallId, 
                arguments = funcCall.Arguments?.ToString() 
            });
        }
    }

    // 4. Emit completion events
    await EmitSse(response, "response.output_item.done", new { output_index = 0, item = new { id = outputItemId, type = "message" } });
    await EmitSse(response, "response.completed", new { id = responseId, status = "completed" });
    await EmitSse(response, "[DONE]", new { });
}

async Task EmitSse(HttpResponse response, string eventType, object data)
{
    var json = JsonSerializer.Serialize(data);
    await response.WriteAsync($"event: {eventType}\ndata: {json}\n\n");
    await response.Body.FlushAsync();
}

Step 3: Configure the Vendor-Neutral Client

The gateway remains agnostic to the underlying model. You inject any IChatClient implementation at startup. For OpenRouter, Anthropic, or local Ollama instances, the routing logic remains identical.

// Example: OpenRouter configuration
var openRouterOptions = new ChatClientOptions 
{ 
    Endpoint = new Uri("https://openrouter.ai/api/v1"),
    ModelId = Environment.GetEnvironmentVariable("INFERENCE_MODEL") ?? "anthropic/claude-3.5-sonnet"
};

var chatClient = new OpenAI.Chat.ChatClient(
    openRouterOptions.ModelId, 
    new System.ClientModel.ApiKeyCredential(Environment.GetEnvironmentVariable("INFERENCE_API_KEY")),
    new OpenAI.OpenAIClientOptions { Endpoint = openRouterOptions.Endpoint })
    .AsIChatClient();

builder.Services.AddSingleton<IChatClient>(chatClient);

Architecture Rationale

IChatClient Abstraction: Decouples protocol translation from inference logic. Switching providers requires zero changes to the SSE streaming or routing code.
Passthrough Tool Schemas: Codex expects to execute tools locally. The gateway forwards tool definitions as raw JSON schemas without implementing handlers, allowing the CLI to manage execution while the model generates the calls.
Bounded State Dictionary: previous_response_id chaining is handled via an in-memory dictionary. This avoids database overhead while maintaining turn continuity. In production, this should be swapped for a distributed cache (Redis) if scaling horizontally.
Strict SSE State Machine: The event sequence (created → in_progress → output_item.added → delta → completed) mirrors the exact parsing expectations of Rust-based CLI clients. Deviating from this order causes silent deserialization failures.

Pitfall Guide

1. UTF-8 BOM Corruption in JSON Catalogs

Explanation: .NET's default Encoding.UTF8 emits a Byte Order Mark (EF BB BF). Strict JSON parsers (like Rust's serde_json, which Codex uses) reject BOM-prefixed files per RFC 8259. Fix: Always serialize with new UTF8Encoding(encoderShouldEmitUTF8Identifier: false). Validate generated files with jq . catalog.json before deployment.

2. Context Window Mismatch & Silent Truncation

Explanation: Declaring a context_window larger than the actual model supports causes the CLI to send oversized prompts. The inference provider silently truncates or rejects them, degrading output quality without explicit errors. Fix: Dynamically fetch model metadata from the provider's API or maintain a versioned registry. Never hardcode limits. Implement a pre-flight validation that compares requested token counts against the declared window.

3. Ignoring `previous_response_id` State Management

Explanation: The Responses API relies on previous_response_id to chain turns without resending full history. Dropping this ID breaks conversation continuity, forcing the model to lose context. Fix: Implement a bounded LRU cache for conversation history. Map previous_response_id to the accumulated IList<ChatMessage>. Evict entries older than a configurable TTL to prevent memory leaks.

4. Tool Schema Serialization Drift

Explanation: Codex expects strict JSON Schema drafts (typically draft-07). Minor deviations in type definitions, required fields, or additionalProperties flags cause the CLI to reject tool calls. Fix: Use a schema validation library to verify tool definitions before transmission. Log schema diffs during development. Never manually construct JSON schema strings; use strongly-typed builders.

5. SSE Event Ordering Violations

Explanation: The CLI parses events sequentially. Emitting response.completed before response.output_item.done, or interleaving deltas incorrectly, breaks the state machine. Fix: Implement a strict event emitter wrapper that enforces state transitions. Use a queue-based approach to guarantee FIFO delivery. Add integration tests that replay exact event sequences against a mock parser.

6. Catalog File Replacement vs. Extension

Explanation: The model_catalog_json configuration key replaces the CLI's bundled catalog entirely. Omitting built-in models (like gpt-5-codex) breaks fallback behavior for users who expect them. Fix: Merge your custom entries with the official catalog JSON before writing to disk. Maintain a base template and inject custom slugs programmatically. Document this behavior clearly for team adoption.

7. Environment Variable Scope Leakage

Explanation: Storing API keys in global shell environments risks accidental exposure to child processes or CI pipelines. It also complicates multi-tenant setups. Fix: Use scoped process environments or .env files loaded at gateway startup. Implement a configuration validator that fails fast if required keys are missing. Rotate keys via secret management tools (HashiCorp Vault, AWS Secrets Manager) in production.

Production Bundle

Action Checklist

Validate JSON catalog files with strict parsers before deployment
Implement bounded LRU cache for previous_response_id state management
Enforce strict SSE event ordering via a state machine wrapper
Dynamically resolve context windows from provider metadata APIs
Merge custom model entries with official catalog to preserve fallbacks
Scope API keys to process environments; avoid global shell exports
Add schema validation middleware for all tool definitions
Configure connection pooling and retry logic for streaming endpoints

Decision Matrix

Scenario	Recommended Approach	Why	Cost Impact
Local Development / Prototyping	In-memory dictionary + file-based catalog	Zero infrastructure overhead; fast iteration	None
Multi-Model Routing (Claude/Gemini/Llama)	OpenRouter or unified aggregator gateway	Single API key; automatic fallback routing	Moderate (aggregator markup)
High-Throughput CI/CD Pipelines	Distributed cache (Redis) + stateless gateway	Horizontal scaling; shared conversation state	Low (infrastructure cost)
Strict Compliance / Air-Gapped	Self-hosted Ollama + local catalog	No external network calls; full data control	High (hardware/ops)

Configuration Template

gateway-config.toml

model = "custom-aggregator-slug"
model_provider = "local-gateway"
model_catalog_json = "./catalog.json"

[model_providers.local-gateway]
name = "Vendor-Neutral Responses Gateway"
base_url = "http://localhost:8080/v1"
wire_api = "responses"
env_key = "GATEWAY_AUTH_TOKEN"
stream_idle_timeout_ms = 300000

catalog.json

{
  "models": [
    {
      "slug": "custom-aggregator-slug",
      "display_name": "Aggregator (Claude/Gemini/Llama)",
      "description": "Vendor-neutral inference gateway",
      "supported_reasoning_levels": [],
      "shell_type": "default",
      "visibility": "list",
      "supported_in_api": true,
      "priority": 50,
      "availability_nux": null,
      "upgrade": null,
      "base_instructions": "",
      "supports_reasoning_summaries": false,
      "support_verbosity": false,
      "default_verbosity": null,
      "apply_patch_tool_type": "freeform",
      "truncation_policy": { "mode": "tokens", "limit": 8192 },
      "supports_parallel_tool_calls": true,
      "context_window": 200000,
      "max_context_window": 200000,
      "auto_compact_token_limit": 180000,
      "effective_context_window_percent": 95,
      "experimental_supported_tools": []
    }
  ]
}

Quick Start Guide

Initialize the Gateway: Create a new ASP.NET Core minimal API project. Install Microsoft.Extensions.AI and your preferred provider SDK (e.g., OpenAI, Anthropic, or Ollama).
Configure Environment Variables: Set INFERENCE_API_KEY and INFERENCE_MODEL in your shell or .env file. Ensure the model slug matches your provider's routing format.
Generate Catalog & Config: Run the gateway startup routine to emit catalog.json and gateway-config.toml into a dedicated directory. Verify JSON validity with jq . catalog.json.
Launch & Connect: Start the gateway (dotnet run). In a separate terminal, set CODEX_HOME to your config directory and GATEWAY_AUTH_TOKEN to any non-empty string. Execute codex to begin routing through the gateway.

Run OpenAI Codex CLI on Claude, Gemini, or Llama — in 50 lines of C#