` tag convention. The model wraps generated HTML inside this tag, and the client's streaming parser intercepts it to render live previews. This prevents raw code from polluting the conversation stream and enables tabbed file management.
3. Server-Side Tool Loop: Local models typically lack native tool execution. The proxy intercepts tool call requests, executes them server-side (file reads, web search, subtask delegation), and feeds results back into the context. This transforms a stateless completion model into a stateful agent without modifying the inference engine.
4. Containerized Deployment: The client runs in Docker to isolate dependencies and guarantee consistent rendering environments. The proxy runs as a Node.js service to leverage its native streaming and event-loop capabilities. This separation ensures that GPU memory pressure from inference doesn't impact the UI thread or proxy routing logic.
Implementation Walkthrough
Step 1: Configure the Intelligent Routing Proxy
The proxy translates incoming requests into the target model's expected format, manages conversation history, and enforces routing rules. Below is a TypeScript configuration that demonstrates how to wire up routing logic, token budgeting, and artifact-aware streaming.
import { createServer } from 'http';
import { Router } from './routing-engine';
import { ContextManager } from './context-compressor';
import { ArtifactStream } from './stream-parser';
const PROXY_PORT = 8081;
const MAX_CONTEXT_TOKENS = 128000;
const COMPRESSION_THRESHOLD = 0.85;
const router = new Router({
providers: {
local: { endpoint: 'http://localhost:11434', type: 'ollama' },
fallback: { endpoint: 'https://api.anthropic.com', type: 'anthropic' }
},
routingRules: {
simple: 'local',
complex: 'fallback',
agentic: 'local' // proxy handles tool loop
}
});
const contextMgr = new ContextManager(MAX_CONTEXT_TOKENS, COMPRESSION_THRESHOLD);
const server = createServer(async (req, res) => {
if (req.url === '/v1/messages' && req.method === 'POST') {
const payload = await parseRequestBody(req);
// Compress history before overflow
const optimizedContext = contextMgr.compress(payload.messages);
// Route based on complexity analysis
const targetProvider = router.analyzeAndRoute(optimizedContext);
// Stream response with artifact interception
const stream = new ArtifactStream(res);
await targetProvider.generate(optimizedContext, stream);
res.end();
}
});
server.listen(PROXY_PORT, () => {
console.log(`Routing proxy active on port ${PROXY_PORT}`);
});
Why this structure? Separating routing logic from context management prevents monolithic request handlers. The compression threshold triggers automatic summarization of older turns, preserving recent design iterations while staying within window limits. The artifact stream wrapper ensures HTML blocks are cleanly extracted before reaching the client.
Step 2: Deploy the Design Client
The client handles project workspaces, design system binding, and live artifact rendering. It expects the proxy to return Anthropic-formatted SSE streams.
# docker-compose.design.yml
version: '3.9'
services:
design-studio:
image: nexuio/open-design:latest
container_name: design-pipeline
ports:
- "7456:7456"
environment:
- PROXY_ENDPOINT=http://host.docker.internal:8081
- ARTIFACT_PARSER_MODE=strict
- DESIGN_SYSTEM_PATH=/workspace/design-tokens.md
extra_hosts:
- "host.docker.internal:host-gateway"
volumes:
- ./projects:/app/projects
- ./design-systems:/app/design-systems
restart: unless-stopped
Why Docker + extra_hosts? The client requires network access to the host machine where the proxy runs. host.docker.internal resolves to the host IP from within the container, bypassing NAT loopback issues. Volume mounts persist project state and design system definitions across container restarts.
Step 3: Wire the Artifact Contract
The client's streaming parser watches for the <artifact> tag. When detected, it creates a file entry, streams inner HTML to an iframe, and saves the artifact upon closure.
class ArtifactParser {
private buffer: string = '';
private activeArtifact: ArtifactState | null = null;
processChunk(rawChunk: string): ParsedOutput {
this.buffer += rawChunk;
const outputs: ParsedOutput = { chat: '', artifacts: [] };
const artifactRegex = /<artifact\s+identifier="([^"]+)"\s+type="([^"]+)"\s+title="([^"]+)">([\s\S]*?)<\/artifact>/g;
let match;
while ((match = artifactRegex.exec(this.buffer)) !== null) {
const [, id, type, title, content] = match;
outputs.artifacts.push({ id, type, title, content, status: 'complete' });
this.buffer = this.buffer.replace(match[0], '');
}
outputs.chat = this.buffer.trim();
return outputs;
}
}
Why regex-based streaming? While DOM parsing is safer for final output, regex extraction during streaming minimizes latency. The parser strips artifact blocks from the chat stream, keeping the conversation clean while routing HTML to the preview pane. Strict mode ensures malformed tags don't crash the renderer.
Pitfall Guide
1. Docker Host Resolution Failure
Explanation: Containers cannot reach localhost to communicate with the proxy running on the host machine. This breaks the API mode configuration.
Fix: Always use host.docker.internal in Docker Compose or --add-host=host.docker.internal:host-gateway in docker run. Verify connectivity with curl http://host.docker.internal:8081/health from inside the container.
2. ANSI Escape Code Leakage in HTML Streams
Explanation: Local models sometimes emit terminal formatting codes (e.g., \x1b[32m) into generated HTML, corrupting CSS and breaking iframe rendering.
Fix: Implement a sanitization middleware in the proxy that strips non-printable characters before streaming. Use a regex like /[\x00-\x1F\x7F-\x9F]/g and replace with empty strings before passing chunks to the client.
3. Context Window Overflow Without Compression
Explanation: Multi-turn design sessions accumulate tokens rapidly. Without proactive compression, the proxy will hit model limits and truncate critical recent instructions.
Fix: Set a compression threshold at 80-85% of the model's context window. Use semantic summarization for older turns while preserving exact text for the last 3-5 design iterations. Monitor token usage via proxy telemetry.
4. GPU Memory Thrashing
Explanation: Running large models alongside proxy routing and client rendering can exhaust VRAM, causing OOM kills or severe latency spikes.
Fix: Deploy a headroom monitor that tracks GPU utilization. Configure the proxy to shed load by routing complex tasks to cloud fallbacks when local memory exceeds 90%. Use nvidia-smi or rocm-smi polling integrated into the proxy's health endpoint.
5. Rigid Routing Rules
Explanation: Hardcoding all requests to local models ignores task complexity. Simple CSS tweaks don't require heavy reasoning models, while multi-component layouts do.
Fix: Implement dynamic routing based on prompt analysis. Classify requests as simple (styling, text changes), complex (layout restructuring, component generation), or agentic (file system operations, multi-step workflows). Route accordingly to balance cost and capability.
6. Artifact Tag Mismatch
Explanation: The client parser expects exact <artifact> formatting. Missing attributes or malformed closing tags cause silent failures where designs don't appear in the file panel.
Fix: Enforce strict artifact schema validation in the proxy. If the model outputs malformed tags, wrap the response in a system prompt reminder: Always output designs using <artifact identifier="file.html" type="text/html" title="Name">...</artifact>. Log parsing failures for model fine-tuning.
7. SSE Timeout and Buffer Bloat
Explanation: Long generation sessions can trigger HTTP timeout limits or cause client-side buffer overflow, resulting in dropped frames or frozen previews.
Fix: Configure proxy keep-alive headers and chunk flush intervals. Set Connection: keep-alive and flush every 256 bytes. On the client, implement a streaming reader with backpressure handling to prevent memory leaks during extended sessions.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Rapid prototyping with brand guidelines | Proxy-routed local (Ollama) | Zero data exfiltration, instant iteration, design system binding | $0.00/session |
| Complex multi-component dashboard | Hybrid routing (local + cloud fallback) | Heavy reasoning requires frontier models; proxy handles tool loop | $0.15-$0.40/session |
| Enterprise compliance (HIPAA/FedRAMP) | Fully on-premise proxy stack | No external endpoints, full audit trail, token budgeting | $0.00/session + infra |
| High-frequency UI testing | Direct local API (bypass proxy) | Lower overhead for single-turn generation, simpler stack | $0.00/session |
Configuration Template
# proxy-config.yml
server:
port: 8081
health_check: /health
dashboard: /dashboard
routing:
default_provider: ollama
fallback_provider: anthropic
complexity_thresholds:
simple: { max_tokens: 500, route: ollama }
complex: { min_tokens: 501, route: anthropic }
agentic: { tools: true, route: ollama }
context:
max_window: 128000
compression_trigger: 0.85
preserve_recent_turns: 4
streaming:
chunk_size: 256
flush_interval_ms: 100
timeout_sec: 60
monitoring:
gpu_headroom: true
token_budget: true
telemetry_endpoint: /metrics
# Dockerfile.proxy
FROM node:20-slim
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production
COPY src/ ./src/
EXPOSE 8081
CMD ["node", "src/server.js", "--config", "proxy-config.yml"]
Quick Start Guide
- Pull local models: Run
ollama pull minimax-m2.5:cloud for visual reasoning or ollama pull qwen2.5-coder:7b for lightweight HTML generation.
- Launch the proxy: Execute
npm install && node src/server.js --port 8081 to start the routing layer. Verify with curl http://localhost:8081/health.
- Start the design client: Run
docker compose -f docker-compose.design.yml up -d to deploy the containerized UI. Access at http://localhost:7456.
- Configure API mode: In the client settings, select BYOK/Anthropic protocol, set base URL to
http://host.docker.internal:8081, and use any placeholder API key.
- Generate first artifact: Create a new project, attach a
DESIGN.md file, and prompt the assistant. Watch the <artifact> block stream into the preview pane in real time.