Everyone is squeezing context. We stopped putting everything in one context.
Current Situation Analysis
Multi-agent AI pipelines suffer from a structural inefficiency that prompt compression and cheaper models cannot fix: context delivery bloat. When a feature requires coordination across multiple specialized agents (architect, developer, QA, security, etc.), the default pattern is to inject the entire project state into every agent's context window. This includes full architecture documents, implementation plans, and append-only decision logs.
The industry treats this as a prompt engineering problem. Teams truncate text, add Be concise. directives, or enable response caching. These are symptomatic fixes. The actual bottleneck is information architecture. Agents are receiving library-scale documentation when they only need a single reference card.
The consequence is predictable: token consumption scales linearly with project age, not task complexity. A six-month-old project with 50 artifacts and 50 historical decisions forces every new agent run to process hundreds of thousands of irrelevant tokens. At standard Sonnet pricing ($3/1M input tokens), this translates to $0.35+ in wasted spend per feature pipeline. Latency compounds as models parse noise they are explicitly instructed to ignore. The fundamental flaw is assuming LLMs can reliably filter irrelevant context on their own. They can, but doing so burns tokens, increases variance, and degrades instruction following.
WOW Moment: Key Findings
Restructuring context delivery from bulk injection to task-aware routing yields disproportionate efficiency gains. The following comparison isolates the impact of replacing full-document injection with structured summaries and filtered memory logs.
| Approach | Tokens/Agent | Cost/Feature Run | Scalability Behavior |
|---|---|---|---|
| Bulk Injection (Legacy) | 21,459 | $0.40+ | Linear growth with project size |
| Context-Aware Routing | 2,216 | $0.05 | Flat curve; noise isolated |
Why this matters: The routing approach decouples context size from repository maturity. Summaries cap per-agent reads at ~250 tokens. Memory filtering extracts only the top-K relevant entries from append-only logs. The full documentation remains intact and accessible, but agents only fetch depth when explicitly required. This transforms context management from a storage problem into a delivery problem, enabling predictable costs and consistent latency regardless of project age.
Core Solution
The architecture replaces blind context injection with a two-stage delivery pipeline: artifact summarization and task-scoped memory filtering. Both stages operate as producer-consumer contracts, ensuring agents receive structured, actionable context without losing access to source truth.
Stage 1: Artifact Summarization Pipeline
Instead of injecting full markdown documents, the system generates paired summary files on artifact creation or modification. Summaries follow a strict schema: decision, stack, risks, and a reference path to the full document. Generation uses a fallback chain to balance cost and reliability:
- Anthropic Haiku (API-driven, ~$0.0005/call, ~200ms latency)
- OpenRouter Kimi K2 (fallback API)
- Deterministic keyword extraction (zero-cost, offline, ~50ms latency)
import { readFileSync, writeFileSync, existsSync } from 'fs';
import { join } from 'path';
interface ArtifactSummary {
title: string;
decision: string;
stack: string[];
risks: string[];
sourcePath: string;
}
export class ContextSummarizer {
private readonly MAX_TOKENS = 250;
private readonly FALLBACK_THRESHOLD = 0.8;
constructor(private readonly outputDir: string) {}
async generateFromArtifact(artifactPath: string): Promise<ArtifactSummary> {
const raw = readFileSync(artifactPath, 'utf-8');
// Attempt API summarization first
const apiResult = await this.tryAPISummary(raw);
if (apiResult && this.validateSchema(apiResult)) {
return this.persistSummary(artifactPath, apiResult);
}
// Fallback to deterministic extraction
const heuristic = this.extractHeuristic(raw);
return this.persistSummary(artifactPath, heuristic);
}
private async tryAPISummary(content: string): Promise<ArtifactSummary | null> {
// Placeholder for Haiku/Kimi K2 integration
// Returns null on timeout or quota exhaustion
return null;
}
private extractHeuristic(content: string): ArtifactSummary {
const lines = content.split('\n').filter(l => l.trim());
return {
title: lines[0]?.replace(/^#+\s*/, '') || 'Untitled',
decision: lines.find(l => l.toLowerCase().includes('decision')) || 'N/A',
stack: lines.filter(l => l.toLowerCase().includes('stack') || l.toLowerCase().includes('tech')),
risks: lines.filter(l => l.toLowerCase().includes('risk') || l.toLowerCase().includes('caveat')),
sourcePath: content
};
}
private validateSchema(summary: ArtifactSummary): boolean {
const tokenEstimate = JSON.stringify(summary).length / 4;
return tokenEstimate <= this.MAX_TOKENS && !!summary.decision;
}
private persistSummary(source: string, summary: ArtifactSummary): ArtifactSummary {
const outPath = join(this.outputDir, `${summary.title.replace(/\s+/g, '_')}.summary.json`);
writeFileSync(outPath, JSON.stringify(summary, null, 2));
return summary;
}
}
Stage 2: Task-Scoped Memory Filtering
Append-only logs like decisions.md grow indefinitely. Injecting the entire file wastes context on historical entries unrelated to the current task. The filter scores each log entry against the task description and returns only the top-K matches.
export class MemoryFilter {
private readonly DEFAULT_K = 5;
async filterByTask(
logPath: string,
taskDescription: string,
k: number = this.DEFAULT_K
): Promise<string[]> {
const raw = readFileSync(logPath, 'utf-8');
const entries = this.parseEntries(raw);
const scored = entries.map(entry => ({
entry,
score: this.computeRelevance(entry, taskDescription)
}));
return scored
.sort((a, b) => b.score - a.score)
.slice(0, k)
.map(s => s.entry);
}
private parseEntries(content: string): string[] {
return content.split(/\n---\n/).filter(e => e.trim().length > 0);
}
private computeRelevance(entry: string, task: string): number {
const taskTokens = new Set(task.toLowerCase().split(/\s+/));
const entryTokens = entry.toLowerCase().split(/\s+/);
let matches = 0;
entryTokens.forEach(token => {
if (taskTokens.has(token)) matches++;
});
return matches / Math.max(taskTokens.size, entryTokens.length);
}
}
Architecture Rationale
- Producer-Consumer Contract: Summaries and filtered logs are generated before agent initialization. This shifts context preparation from runtime to pipeline setup, eliminating per-agent parsing overhead.
- Graceful Degradation: The fallback chain ensures zero-downtime summarization. Heuristic extraction trades semantic depth for deterministic cost and speed, which is acceptable for routine artifacts.
- Depth-on-Demand: Summaries explicitly reference full documents. Agents that require architectural nuance can fetch the source file. This preserves completeness while optimizing the common path.
- Token Budget Enforcement: The 250-token cap on summaries prevents summary bloat. The filter's
kparameter caps memory injection. Together, they create predictable context windows regardless of repository size.
Pitfall Guide
| Pitfall | Explanation | Fix |
|---|---|---|
| Summary Schema Drift | Unstructured summaries lose critical fields (risks, stack, decisions), forcing agents to read full docs anyway. | Enforce a strict JSON/YAML schema with required fields. Validate against token budget before persistence. |
| Filter False Negatives | Keyword scoring misses semantically relevant entries that use different terminology. | Implement hybrid scoring: combine lexical overlap with lightweight embedding similarity or synonym mapping. |
| Stale Summaries | Artifacts are updated but summaries are not regenerated, causing agents to act on outdated context. | Bind summary generation to file system hooks or CI/CD pipelines. Invalidate summaries on mtime change. |
| Over-Filtering Memory | Setting k too low drops critical historical context, causing repeated mistakes. |
Start with k=5, monitor agent error rates, and implement a fallback flag (CONTEXT_FILTER_RELAX=1) to expand scope when confidence drops. |
| Assuming Model Noise Tolerance | Relying on ignore irrelevant context instructions burns tokens and increases output variance. |
Remove noise structurally. Context routing is cheaper and more reliable than prompt engineering. |
| Centralized Context Bottleneck | All agents share a single context router, creating contention in parallel pipelines. | Scope context routers to agent instances. Use session-level caching for shared reads like PROJECT.md. |
| Missing Observability | No visibility into token deltas per pipeline run makes optimization guesswork. | Instrument context preparation steps. Log tokens injected, filter hit rates, and fallback trigger frequency. |
Production Bundle
Action Checklist
- Define artifact summary schema with required fields (decision, stack, risks, source reference)
- Implement file-watch or CI hook to trigger summarization on artifact modification
- Configure fallback chain: primary API β secondary API β deterministic heuristic
- Deploy memory filter with configurable
kparameter and task-description input - Add context routing middleware to agent initialization pipeline
- Instrument token injection metrics per agent and per pipeline run
- Establish opt-out mechanism for legacy behavior during migration
- Validate summary freshness against artifact
mtimebefore agent dispatch
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|---|---|---|
| New project (< 20 artifacts) | Full injection + basic summarization | Low baseline noise; summarization overhead may not justify ROI | Neutral to +5% |
| Mature project (> 50 artifacts) | Context-aware routing with strict schema | Noise scales linearly; routing caps per-agent tokens | -80% to -90% |
| High-latency constraints | Heuristic summarization + lexical filter | API calls add 150-200ms; deterministic methods run in <50ms | $0.0000/call |
| Compliance/Audit requirements | Full doc access + filtered summaries | Regulators require complete decision trails; summaries optimize agent reads | +10% storage, -85% compute |
| Parallel multi-agent runs | Session-scoped context cache | Prevents redundant reads of shared artifacts across agents | -15% to -20% |
Configuration Template
context_routing:
summarizer:
max_tokens: 250
schema_version: "1.2"
fallback_chain:
- provider: "anthropic"
model: "haiku"
timeout_ms: 2000
- provider: "openrouter"
model: "kimi-k2"
timeout_ms: 3000
- provider: "heuristic"
mode: "keyword_extraction"
memory_filter:
default_k: 5
scoring_method: "hybrid_lexical"
relax_threshold: 0.6
opt_out_env: "CONTEXT_FILTER_DISABLED"
observability:
log_injection_tokens: true
track_fallback_triggers: true
pipeline_metric_prefix: "ctx_router"
Quick Start Guide
- Initialize the router: Deploy the
ContextSummarizerandMemoryFilterclasses to your pipeline orchestration layer. Configure the YAML template to match your environment variables and fallback preferences. - Hook artifact writes: Attach a post-write hook to your artifact generation step. Trigger
generateFromArtifact()immediately after any markdown or JSON artifact is persisted. Verify summaries appear in the designated output directory. - Wire agent context: Replace bulk document injection in your agent initialization logic. Call
filterByTask(taskDescription)for memory logs and inject the corresponding.summary.jsonfiles. Ensure agents retain a reference path to full documents. - Validate and monitor: Run a test pipeline with 3-5 agents. Check observability logs for token deltas, filter hit rates, and fallback triggers. Adjust
kandmax_tokensbased on agent output quality and latency targets. - Deploy to production: Roll out routing to non-critical pipelines first. Monitor cost per feature run and agent error rates. Once stable, expand to full multi-agent workflows and enable session-scoped caching for shared reads.
Mid-Year Sale β Unlock Full Article
Base plan from just $4.99/mo or $49/yr
Sign in to read the full article and unlock all tutorials.
Sign In / Register β Start Free Trial7-day free trial Β· Cancel anytime Β· 30-day money-back
