cost center into a predictable operational metric.
Core Solution
Implementing a dual-stream optimization pipeline requires two distinct components: a terminal output sanitizer that operates as a proxy layer, and a response constraint engine that modifies system prompt behavior. Both components must integrate without disrupting existing command execution or model routing.
Step 1: Terminal Output Sanitization Proxy
The sanitizer intercepts stdout and stderr before they reach the LLM client. It parses the raw byte stream, identifies non-essential formatting, and reconstructs a clean payload. The architecture uses a streaming filter pattern to avoid buffering large outputs in memory.
import { Transform } from 'stream';
interface SanitizerOptions {
preserveExitCodes: boolean;
stripAnsi: boolean;
collapseProgress: boolean;
}
export class TerminalStreamSanitizer extends Transform {
private buffer: string = '';
private readonly options: SanitizerOptions;
constructor(options: SanitizerOptions) {
super({ objectMode: true });
this.options = options;
}
_transform(chunk: Buffer, _encoding: string, callback: (error?: Error | null, data?: string) => void): void {
const raw = chunk.toString('utf-8');
this.buffer += raw;
// Process complete lines to avoid partial ANSI sequence corruption
const lines = this.buffer.split('\n');
this.buffer = lines.pop() || '';
const cleaned = lines
.map(line => this.processLine(line))
.filter(line => line.length > 0)
.join('\n');
if (cleaned.length > 0) {
this.push(cleaned + '\n');
}
callback();
}
_flush(callback: (error?: Error | null, data?: string) => void): void {
if (this.buffer.length > 0) {
this.push(this.processLine(this.buffer));
}
callback();
}
private processLine(line: string): string {
let processed = line;
if (this.options.stripAnsi) {
processed = processed.replace(/\x1b\[[0-9;]*[a-zA-Z]/g, '');
}
if (this.options.collapseProgress) {
processed = processed.replace(/\r.*?([0-9]+%)/g, '[$1]');
}
if (this.options.preserveExitCodes) {
const exitMatch = processed.match(/exit\s+code\s+(\d+)/i);
if (exitMatch) {
return `[EXIT:${exitMatch[1]}]`;
}
}
return processed.trim();
}
}
Architecture Rationale:
- Streaming transform over buffering: Prevents memory spikes when commands produce megabytes of output.
- Line-based processing: ANSI escape sequences and progress indicators are line-scoped. Processing complete lines avoids corrupting partial byte sequences.
- Configurable flags: Allows teams to toggle sanitization depth based on command type. Debugging sessions may require more raw output than routine status checks.
Step 2: Response Brevity Constraint Engine
Output optimization operates at the prompt layer. Instead of relying on model self-regulation, the system injects a structural constraint that forces terse, schema-aligned responses. This is implemented as a middleware that wraps the system prompt before transmission.
interface PromptConstraintConfig {
maxTokens: number;
enforceSchema: boolean;
suppressPreamble: boolean;
}
export class ResponseBrevityMiddleware {
private readonly config: PromptConstraintConfig;
constructor(config: PromptConstraintConfig) {
this.config = config;
}
apply(systemPrompt: string): string {
const constraints: string[] = [];
if (this.config.suppressPreamble) {
constraints.push('Omit all conversational preambles, greetings, and explanatory transitions.');
}
if (this.config.enforceSchema) {
constraints.push('Structure all responses using only the following fields: status, summary, action_items. Exclude markdown formatting unless explicitly requested.');
}
if (this.config.maxTokens > 0) {
constraints.push(`Keep response length strictly under ${this.config.maxTokens} tokens.`);
}
const constraintBlock = constraints.join('\n- ');
return `${systemPrompt}\n\n## Output Constraints\n- ${constraintBlock}`;
}
}
Architecture Rationale:
- Prompt-level enforcement over fine-tuning: Zero infrastructure overhead, reversible, and compatible with any model provider.
- Schema alignment: Forces the model into a predictable output structure, making downstream parsing deterministic.
- Token budget anchoring: Explicit token limits prevent runaway verbosity without requiring post-generation truncation.
Step 3: Integration Pipeline
The two components connect through a unified execution wrapper. The proxy sanitizes command output, the middleware constrains the system prompt, and the LLM client routes the optimized payload.
export class AgenticTokenOptimizer {
private sanitizer: TerminalStreamSanitizer;
private promptMiddleware: ResponseBrevityMiddleware;
constructor() {
this.sanitizer = new TerminalStreamSanitizer({
preserveExitCodes: true,
stripAnsi: true,
collapseProgress: true
});
this.promptMiddleware = new ResponseBrevityMiddleware({
maxTokens: 150,
enforceSchema: true,
suppressPreamble: true
});
}
async executeCommand(command: string, baseSystemPrompt: string): Promise<string> {
const optimizedPrompt = this.promptMiddleware.apply(baseSystemPrompt);
const rawOutput = await this.runShellCommand(command);
const sanitizedOutput = await this.sanitizeStream(rawOutput);
return await this.queryModel(optimizedPrompt, sanitizedOutput);
}
private async runShellCommand(cmd: string): Promise<ReadableStream> {
// Shell execution logic returning a readable stream
throw new Error('Shell execution stub');
}
private async sanitizeStream(stream: ReadableStream): Promise<string> {
return new Promise((resolve, reject) => {
const chunks: string[] = [];
stream.pipe(this.sanitizer)
.on('data', (chunk: string) => chunks.push(chunk))
.on('end', () => resolve(chunks.join('')))
.on('error', reject);
});
}
private async queryModel(prompt: string, context: string): Promise<string> {
// LLM API call with optimized prompt and sanitized context
throw new Error('Model query stub');
}
}
Why This Architecture Works:
- Separation of concerns: I/O filtering and prompt engineering operate independently, allowing parallel optimization.
- Deterministic parsing: Schema-constrained outputs eliminate regex-heavy post-processing.
- Provider-agnostic: The pipeline wraps the LLM client, making it compatible with Claude, GPT, or open-weight models without vendor lock-in.
Pitfall Guide
1. Over-Aggressive ANSI Stripping
Explanation: Naive regex replacement of escape sequences can corrupt terminal-dependent tools that rely on cursor positioning or color-coded output for parsing.
Fix: Use a state-machine parser that tracks escape sequence boundaries. Preserve sequences that map to known terminal control functions while stripping decorative formatting.
2. Exit Code Loss During Compression
Explanation: Stripping repetitive lines may accidentally remove the final line containing the process exit status, breaking error-handling logic.
Fix: Implement a lookahead buffer that guarantees the last three lines of output are preserved regardless of compression rules. Explicitly tag exit codes with a standardized prefix.
3. Prompt Constraint Drift
Explanation: Brevity instructions can cause the model to omit critical technical details, especially when handling complex stack traces or multi-step debugging tasks.
Fix: Anchor constraints to task-specific schemas rather than generic brevity rules. Use conditional prompting that relaxes constraints when error severity exceeds a defined threshold.
4. Streaming Tokenization Overhead
Explanation: Processing output line-by-line without batching can increase tokenization API calls, negating some of the input savings.
Fix: Accumulate sanitized chunks into fixed-size blocks before tokenization. Use streaming-aware token counters that estimate payload size without full API round-trips.
5. Assuming Output Reduction Hurts Reasoning
Explanation: Teams often fear that terse responses degrade model accuracy. In practice, verbosity correlates with conversational padding, not reasoning depth.
Fix: A/B test response quality with and without constraints. Reserve verbose prompting for complex architectural decisions; apply strict brevity for routine status checks and diff summaries.
6. Proxy Routing Misconfiguration
Explanation: Applying sanitization to all commands indiscriminately can break interactive CLI tools that expect raw terminal behavior.
Fix: Maintain an allowlist of non-interactive commands (e.g., git status, cargo test, find, grep) and bypass the proxy for interactive sessions or TUI applications.
7. Context Window Fragmentation
Explanation: Sanitized output may still exceed context limits if multiple commands are concatenated without chunking.
Fix: Implement a sliding window manager that prioritizes recent command outputs and discards older, resolved steps. Tag each chunk with a command identifier for traceability.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
High-frequency status checks (git status, ls) | Full sanitization + strict brevity | Output is highly repetitive; model only needs state changes | -85% token cost |
| Complex debugging sessions | Partial sanitization + schema-constrained verbose | Stack traces require preservation; reasoning depth matters | -40% token cost |
| Interactive CLI workflows | Bypass proxy + default prompting | Terminal state management requires raw I/O | Baseline cost |
| Batch processing pipelines | Aggressive compression + token budget anchoring | Deterministic output enables automated parsing | -90% token cost |
Configuration Template
# agentic-token-optimizer.config.yaml
pipeline:
sanitizer:
strip_ansi: true
collapse_progress: true
preserve_exit_codes: true
max_line_length: 120
allowed_commands:
- git status
- cargo test
- find
- grep
- npm run lint
- docker compose ps
prompt_constraints:
max_output_tokens: 150
enforce_schema: true
suppress_preamble: true
schema_fields:
- status
- summary
- action_items
routing:
bypass_interactive: true
fallback_on_error: true
token_tracking: true
Quick Start Guide
- Install the proxy wrapper: Add the
TerminalStreamSanitizer and ResponseBrevityMiddleware classes to your agent's execution module. Configure the YAML template to match your command set.
- Hook into the command runner: Replace direct shell execution with the
AgenticTokenOptimizer.executeCommand() method. Ensure stdout/stderr streams are piped through the sanitizer before reaching the LLM client.
- Apply prompt constraints: Wrap your base system prompt with the middleware before each API call. Test with a routine command like
git status to verify output structure and token reduction.
- Instrument token tracking: Log pre- and post-sanitization token counts for the first 50 executions. Validate that exit codes and error messages are preserved while conversational padding is eliminated.
- Scale to production: Roll out to all non-interactive commands. Monitor latency and task completion metrics. Adjust filtering flags if specific tools require raw terminal behavior.