Base Layer
The built-in memory files (MEMORY.md, USER.md) are always active. They use ยง delimiters, auto-reject duplicates, and scan for injection patterns. Changes persist to disk immediately but inject at the next session boundary to preserve prefix caching.
interface BuiltInMemoryState {
agentNotes: string;
userProfile: string;
usagePercent: number;
lastSnapshot: Date;
}
function initializeBaseLayer(configDir: string): BuiltInMemoryState {
const memoryPath = path.join(configDir, 'memories', 'MEMORY.md');
const userPath = path.join(configDir, 'memories', 'USER.md');
// Parse header for usage metrics and enforce consolidation threshold
const rawMemory = fs.readFileSync(memoryPath, 'utf-8');
const match = rawMemory.match(/MEMORY \[(\d+)%/);
const usage = match ? parseInt(match[1], 10) : 0;
return {
agentNotes: rawMemory,
userProfile: fs.readFileSync(userPath, 'utf-8'),
usagePercent: usage,
lastSnapshot: new Date()
};
}
Step 2: Provision the External Provider
Only one external provider can be active. It layers on top of the base layer. Configuration is managed via ~/.hermes/config.yaml or CLI. The orchestrator validates connectivity and establishes fallback routes.
type MemoryProvider = 'hindsight' | 'holographic' | 'openviking' | 'mem0' | 'honcho' | 'byterover' | 'retaindb' | 'supermemory';
interface ProviderConfig {
active: MemoryProvider;
endpoint?: string;
apiKey?: string;
tokenBudget: number;
fallbackToBuiltIn: boolean;
}
async function provisionExternalProvider(config: ProviderConfig): Promise<void> {
const yamlPath = path.join(process.env.HOME || '', '.hermes', 'config.yaml');
const current = yaml.parse(fs.readFileSync(yamlPath, 'utf-8'));
current.memory = { provider: config.active };
fs.writeFileSync(yamlPath, yaml.stringify(current));
// Validate connectivity and establish circuit breaker
if (config.endpoint) {
await validateEndpoint(config.endpoint, config.apiKey);
}
console.log(`[Memory] Provider ${config.active} activated. Fallback: ${config.fallbackToBuiltIn}`);
}
Step 3: Implement Token Budgeting & Tiered Injection
Uncontrolled context injection causes latency spikes and cost inflation. The middleware below enforces a token budget, routes to tiered loading when available, and triggers consolidation when built-in thresholds are breached.
class MemoryRouter {
private budget: number;
private provider: MemoryProvider;
constructor(budget: number, provider: MemoryProvider) {
this.budget = budget;
this.provider = provider;
}
async resolveContext(query: string, sessionPhase: 'planning' | 'execution' | 'deep'): Promise<string> {
let context = '';
// Tiered routing for OpenViking-style architectures
if (this.provider === 'openviking') {
context = await this.loadTieredContext(query, sessionPhase);
} else {
context = await this.queryExternalStore(query);
}
// Enforce token budget via truncation or fallback
const tokenCount = this.estimateTokens(context);
if (tokenCount > this.budget) {
context = this.truncateToBudget(context, this.budget);
if (this.budget < 50) {
context = await this.fallbackToBuiltIn(query);
}
}
return context;
}
private async loadTieredContext(query: string, phase: string): Promise<string> {
const tierMap = { planning: 'L1', execution: 'L0', deep: 'L2' };
const tier = tierMap[phase as keyof typeof tierMap] || 'L0';
return await this.invokeTool('viking_search', { query, tier });
}
private estimateTokens(text: string): number {
return Math.ceil(text.length / 4); // Approximation for English/Code
}
}
Architecture Decisions & Rationale
- Layered Injection: Built-in memory provides deterministic, low-latency session state. External providers handle scalable retrieval. Separating them prevents external failures from corrupting core agent instructions.
- Frozen Snapshots: Deferring built-in injection to the next session preserves LLM prefix caching. Re-tokenizing the system prompt on every turn destroys cache efficiency and increases latency.
- Tiered Loading: Context requirements change per phase. Loading full documents during planning wastes tokens. L0/L1/L2 routing matches cognitive load to task stage, reducing overhead by 80-90%.
- Circuit Breakers: Cloud providers introduce network latency and rate limits. Wrapping external calls in timeout/retry logic with built-in fallback ensures agent responsiveness during provider outages.
Pitfall Guide
| Pitfall | Explanation | Fix |
|---|
| Snapshot Staleness Misinterpretation | Built-in changes persist to disk but only inject at the next session boundary. Developers expect immediate reflection in the current turn. | Design workflows around session boundaries. Use external providers for intra-session state updates. Document the frozen snapshot behavior in runbooks. |
| Char Limit Blind Spots | MEMORY.md (2,200 chars) and USER.md (1,375 chars) silently truncate when exceeded. The agent auto-consolidates above 80%, but unmonitored growth causes data loss. | Implement pre-injection validation. Trigger manual consolidation routines when usage exceeds 75%. Parse the header percentage programmatically to alert on thresholds. |
| Provider Lock-in & Data Silos | Switching providers wipes the external knowledge base. No automated migration exists. Teams lose months of structured knowledge during provider swaps. | Export provider-specific dumps before migration. Treat external memory as an ephemeral cache. Maintain a canonical export routine in your deployment pipeline. |
| Trust Decay Neglect | Holographic's trust scoring requires explicit confirmation/contradiction signals. Without routing user corrections, memories decay incorrectly or persist as false. | Route explicit user feedback to trust update endpoints. Implement a confirmation loop for critical facts. Monitor trust weights in logs. |
| Token Budget Overflow | Loading full context on every turn inflates costs and latency. Unbounded retrieval causes context window exhaustion. | Enforce a strict token budget middleware. Use tiered loading (L0/L1/L2) or implement semantic compression. Set hard limits on retrieval results. |
| Circuit Breaker Bypass | Cloud providers (Mem0, Honcho, RetainDB) can fail or rate-limit. Unhandled exceptions block agent responses entirely. | Wrap all external calls in timeout/retry logic. Implement a fallback to built-in memory when external latency exceeds 2s. Log failures for capacity planning. |
| Delimiter Collision | Built-in memory uses ยง as an entry separator. User prompts containing this character break parsing and corrupt injection. | Sanitize inputs before injection. Escape or replace ยง in user-generated content. Add a pre-flight validation step in the prompt pipeline. |
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Privacy-first / Air-gapped | Holographic or Hindsight (Local) | Zero external dependencies, algebraic/graph storage, no data exfiltration | $0 infrastructure, minimal compute |
| High-scale / Cost-sensitive | OpenViking | Tiered L0/L1/L2 loading reduces token overhead by 80-90% | Self-hosted compute, near-zero API costs |
| Rapid prototyping / MVP | Mem0 | 30-second cloud setup, server-side extraction, circuit breaker included | Freemium tier, scales with usage |
| Web research / Browser workflows | SuperMemory | Native browser integration, persistent web content capture | Cloud subscription, moderate API costs |
| Multi-agent / Deep user modeling | Honcho | Dialectic reasoning, two-layer context injection, multi-agent profile sharing | Paid cloud or AGPL self-hosted |
| Retrieval quality / Production search | RetainDB | Hybrid vector + BM25 + reranking maximizes precision | Paid cloud, higher per-query cost |
Configuration Template
# ~/.hermes/config.yaml
memory:
provider: hindsight
fallback:
enabled: true
provider: built-in
timeout_ms: 2000
budget:
max_tokens_per_turn: 1500
tiered_loading:
enabled: true
phases:
planning: L1
execution: L0
deep: L2
security:
scan_injection_patterns: true
delimiter_escape: true
auto_consolidate_threshold: 80
// memory-pipeline.ts
import { MemoryRouter } from './memory-router';
import { BuiltInMemoryState } from './built-in';
export class ProductionMemoryPipeline {
private router: MemoryRouter;
private baseState: BuiltInMemoryState;
constructor(configDir: string, provider: string, tokenBudget: number) {
this.baseState = this.initializeBase(configDir);
this.router = new MemoryRouter(tokenBudget, provider as any);
}
async resolve(query: string, phase: 'planning' | 'execution' | 'deep'): Promise<string> {
const sanitizedQuery = this.sanitizeDelimiter(query);
const context = await this.router.resolveContext(sanitizedQuery, phase);
return this.injectBaseLayer(context);
}
private sanitizeDelimiter(input: string): string {
return input.replace(/ยง/g, '[SECTION_BREAK]');
}
private injectBaseLayer(externalContext: string): string {
const basePrefix = `[SYSTEM MEMORY] ${this.baseState.agentNotes}\n[USER PROFILE] ${this.baseState.userProfile}\n`;
return `${basePrefix}\n[EXTERNAL RETRIEVAL]\n${externalContext}`;
}
private initializeBase(dir: string): BuiltInMemoryState {
// Implementation matches Step 1
return { agentNotes: '', userProfile: '', usagePercent: 0, lastSnapshot: new Date() };
}
}
Quick Start Guide
- Verify baseline state: Run
hermes memory status to confirm built-in memory is active and check current usage percentages in ~/.hermes/memories/.
- Select a provider: Execute
hermes memory setup and choose your target provider. For local-first setups, select Hindsight or Holographic. For rapid cloud deployment, select Mem0.
- Configure routing: Update
~/.hermes/config.yaml with the provider name, set a max_tokens_per_turn budget, and enable fallback to built-in memory.
- Validate injection: Start a new session and verify that the system prompt includes both built-in snapshots and external retrieval results. Monitor token usage and latency during the first 10 turns.
- Establish maintenance: Schedule quarterly exports of external knowledge bases, implement consolidation alerts at 75% built-in capacity, and document session boundary behavior for your team.