erted index lookup for function/class names, import paths, and configuration keys.
2. Semantic Proximity: Lightweight code-trained embeddings (e.g., Salesforce/codet5p-137m) to capture intent when exact names are unknown.
3. Structural Extraction: Tree-sitter or language server protocol (LSP) parsing to return complete function bodies, type definitions, or configuration blocks instead of raw line matches.
Step 2: Implement the MCP Server
The following TypeScript implementation demonstrates a context-aware MCP server. It replaces naive file reading with symbol-level extraction and enforces token limits at the protocol boundary.
import { McpServer } from "@modelcontextprotocol/sdk/server/mcp.js";
import { StdioServerTransport } from "@modelcontextprotocol/sdk/server/stdio.js";
import { z } from "zod";
const server = new McpServer({
name: "context-aware-code-server",
version: "1.0.0"
});
server.tool(
"fetch_code_context",
"Retrieve structurally complete code symbols matching a query. Returns only relevant definitions, not full files.",
{
query: z.string().describe("Natural language intent or exact identifier"),
max_tokens: z.number().default(2000).describe("Hard token limit for response"),
language: z.string().optional().describe("Target language for AST parsing")
},
async ({ query, max_tokens, language }) => {
const candidates = await hybridSearch(query, language);
const structured = await extractSymbols(candidates, language);
const budgeted = applyTokenBudget(structured, max_tokens);
return {
content: [{ type: "text", text: budgeted.formatted_output }],
metadata: { tokens_used: budgeted.count, source_files: budgeted.sources }
};
}
);
async function hybridSearch(query: string, lang?: string) {
const exact = await bm25Index.search(query, { limit: 10 });
const semantic = await embeddingModel.rerank(query, { limit: 10 });
return reciprocalRankFusion(exact, semantic);
}
async function extractSymbols(candidates: SearchResult[], lang?: string) {
return candidates.map(c => {
const ast = parseWithTreeSitter(c.file_path, lang);
return ast.extractSymbolAtLine(c.line_number);
});
}
function applyTokenBudget(symbols: CodeSymbol[], limit: number) {
let runningCount = 0;
const selected: CodeSymbol[] = [];
for (const sym of symbols) {
const tokenEst = estimateTokens(sym.raw_text);
if (runningCount + tokenEst > limit) break;
selected.push(sym);
runningCount += tokenEst;
}
return {
formatted_output: selected.map(s => `${s.file}:${s.line}\n${s.raw_text}`).join("\n---\n"),
count: runningCount,
sources: [...new Set(selected.map(s => s.file))]
};
}
async function main() {
const transport = new StdioServerTransport();
await server.connect(transport);
}
main().catch(console.error);
Step 3: Architecture Rationale
Why MCP? The Model Context Protocol standardizes agent-tool communication via JSON-RPC. Building once against this specification ensures compatibility with Claude, Cursor, VS Code Copilot, and OpenAI's agentic interfaces without bespoke integration layers.
Why Hybrid Retrieval? Pure semantic search struggles with exact API names, framework-specific decorators, and configuration keys. Pure keyword search misses intent. Reciprocal Rank Fusion (RRF) merges both signals, prioritizing results that rank highly in either dimension while suppressing noise.
Why AST-Aware Extraction? LLMs require syntactically complete blocks to reason accurately. Returning partial lines or raw file dumps forces the model to reconstruct context internally, wasting tokens and increasing hallucination risk. Tree-sitter guarantees that returned snippets contain complete function signatures, class definitions, or configuration scopes.
Why Token Budgeting? Context windows are finite. Enforcing hard limits at the tool boundary prevents runaway consumption during multi-turn loops. The applyTokenBudget function acts as a circuit breaker, truncating output before it crosses the threshold while preserving the highest-value symbols first.
Pitfall Guide
Explanation: Tools return every matching result without volume constraints. During complex tasks, this rapidly saturates the context window.
Fix: Implement strict max_tokens or top_k parameters at the tool definition level. Always truncate output before serialization, prioritizing structurally complete symbols over partial matches.
2. Ignoring AST Boundaries
Explanation: Returning raw text matches breaks code structure. The agent receives fragmented lines that lack scope, imports, or type information.
Fix: Parse source files with tree-sitter or an LSP client. Extract complete function bodies, type declarations, or configuration blocks. Never return line ranges without structural context.
3. Context Bleed Across Turns
Explanation: Agents re-read identical files across multiple reasoning steps because no state tracks prior retrievals. This multiplies token consumption by 3β10Γ during long workflows.
Fix: Implement an LRU symbol cache with TTL expiration. Before executing a retrieval tool, check the cache for existing symbols. Return cached references instead of re-fetching.
4. Over-Provisioning Embedding Models
Explanation: Deploying billion-parameter embedding models for code search introduces unnecessary latency and infrastructure cost without measurable retrieval gains.
Fix: Use code-optimized models under 200M parameters (e.g., codet5p-137m or sentence-transformers/all-MiniLM-L6-v2 fine-tuned on code). Pair with static BM25 for exact matching. This combination outperforms larger models on retrieval accuracy while reducing inference time by 80%.
5. Missing Fallback Strategies
Explanation: Semantic search fails when queries contain framework-specific syntax, versioned APIs, or exact string literals.
Fix: Implement confidence scoring. When semantic similarity drops below a threshold (e.g., 0.65), automatically route to exact-match grep or LSP symbol lookup. Log fallback triggers to refine retrieval thresholds over time.
6. Neglecting Token Accounting
Explanation: Teams monitor model costs but ignore context consumption per agent turn. This obscures which tools or workflows are driving budget depletion.
Fix: Wrap all tool calls in a TokenBudget middleware that logs input/output token counts, tracks cumulative context usage per session, and throttles or pauses execution when thresholds approach 80% capacity.
7. Over-Reliance on Cloud Inference
Explanation: Routing every retrieval request through external embedding APIs introduces latency, rate limits, and data exfiltration risks.
Fix: Run lightweight embedding models locally using ONNX Runtime or llama.cpp. Reserve cloud APIs for high-complexity reasoning tasks only. Local inference reduces latency to sub-100ms and eliminates per-request costs.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Small repository (<50 files) | Local BM25 + tree-sitter | Low overhead, exact matching suffices, zero API cost | Near-zero |
| Large monorepo (>500 files) | Hybrid BM25 + static embeddings + LRU cache | Balances semantic intent with exact identifiers, prevents context bleed | ~$0.006/query |
| Strict latency requirements (<500ms) | Local ONNX embedding + inverted index | Eliminates network roundtrips, deterministic retrieval | Infrastructure only |
| Budget-constrained deployment | Static TF-IDF + AST extraction | No GPU or cloud API required, indexes in <250ms on CPU | Zero marginal cost |
| High-security environment | Fully local pipeline (no external APIs) | Prevents code exfiltration, maintains air-gapped compliance | Hardware provisioning only |
Configuration Template
{
"mcp_server": {
"name": "context-aware-code-server",
"version": "1.0.0",
"transport": "stdio"
},
"retrieval": {
"strategy": "hybrid_rrf",
"bm25": {
"k1": 1.2,
"b": 0.75,
"limit": 15
},
"embeddings": {
"model": "codet5p-137m",
"backend": "onnx",
"limit": 15,
"confidence_threshold": 0.65
},
"fusion": {
"k": 60,
"fallback": "exact_match_grep"
}
},
"parsing": {
"engine": "tree-sitter",
"extract_mode": "complete_symbol",
"supported_languages": ["typescript", "python", "rust", "go"]
},
"budgeting": {
"max_tokens_per_call": 2000,
"session_limit": 120000,
"throttle_at_percent": 80,
"cache_ttl_seconds": 300,
"cache_max_entries": 500
}
}
Quick Start Guide
- Initialize the MCP server: Run
npm init @modelcontextprotocol/server and select the TypeScript template. Replace the default tool definitions with the fetch_code_context implementation shown above.
- Configure retrieval backends: Install
bm25 and onnxruntime-node. Point the embedding configuration to a local codet5p-137m model file. Set the fusion parameters to match the configuration template.
- Index your codebase: Execute the indexing script against your target repository. Verify that tree-sitter parsers are installed for your primary languages. Confirm average indexing time remains under 300ms per repository.
- Connect to your agent: Update your AI coding client (Claude, Cursor, or custom ReAct loop) to point to the MCP server's stdio transport. Run a test query and verify that responses contain complete symbols, respect token limits, and log metadata correctly.
- Monitor and iterate: Track token consumption per session using the built-in accounting middleware. Adjust
confidence_threshold and max_tokens_per_call based on trace analysis. Deploy to staging and validate against production-scale repositories before full rollout.