1: Trace Collection and Evaluation
Every execution generates a structured trace. The system captures tool invocations, decision branches, error states, and recovery attempts. This data feeds into an evaluator that identifies reusable patterns.
interface ExecutionTrace {
taskId: string;
toolCalls: Array<{ name: string; args: Record<string, any>; status: 'success' | 'error'; latencyMs: number }>;
decisionNodes: Array<{ condition: string; outcome: string; fallback: string | null }>;
outcome: 'completed' | 'partial' | 'failed';
timestamp: number;
}
class TraceEvaluator {
async analyze(trace: ExecutionTrace): Promise<EvaluationReport> {
const complexityScore = this.calculateComplexity(trace);
const recoveryPatterns = this.extractRecoveryLogic(trace);
const noveltyCheck = await this.compareAgainstVault(trace);
return {
qualifiesForSkill: complexityScore >= 5 || recoveryPatterns.length > 0,
extractedPatterns: recoveryPatterns,
noveltyScore: noveltyCheck.similarity < 0.7,
recommendedTags: this.inferTags(trace),
};
}
private calculateComplexity(trace: ExecutionTrace): number {
return trace.toolCalls.length + (trace.decisionNodes.length * 2);
}
private extractRecoveryLogic(trace: ExecutionTrace): string[] {
return trace.toolCalls
.filter(call => call.status === 'error')
.map(call => `Recovery: ${call.name} failed, fallback applied: ${call.args.fallbackStrategy}`);
}
}
Architecture Rationale: The evaluator uses a complexity threshold (β₯5 tool calls or decision nodes) to filter trivial operations. This prevents skill inflation. Recovery patterns are isolated from successful paths because error-handling logic is where most procedural value resides. The novelty check compares incoming traces against existing skills using semantic similarity, ensuring duplicate procedures aren't stored.
Step 2: Procedural Extraction and Manifest Generation
When a trace qualifies, the system generates a structured skill manifest. This replaces opaque prompt templates with explicit, version-controlled documentation.
interface SkillManifest {
name: string;
version: string;
description: string;
metadata: {
tags: string[];
triggerConditions: string[];
successRate: number;
executionCount: number;
minTools: string[];
};
procedure: Array<{
step: number;
action: string;
decisionLogic: string;
antiPatterns: string[];
}>;
}
class SkillExtractor {
async compileFromEvaluation(report: EvaluationReport, trace: ExecutionTrace): Promise<SkillManifest> {
const baseName = this.generateSlug(report.recommendedTags);
return {
name: baseName,
version: '1.0.0',
description: `Automated procedure for ${report.recommendedTags.join(' + ')} workflows`,
metadata: {
tags: report.recommendedTags,
triggerConditions: this.deriveTriggers(trace),
successRate: 0.0,
executionCount: 1,
minTools: [...new Set(trace.toolCalls.map(t => t.name))],
},
procedure: this.structureProcedure(trace, report.extractedPatterns),
};
}
private structureProcedure(trace: ExecutionTrace, patterns: string[]): SkillManifest['procedure'] {
return trace.toolCalls.map((call, index) => ({
step: index + 1,
action: call.name,
decisionLogic: `Execute with args: ${JSON.stringify(call.args)}`,
antiPatterns: patterns.filter(p => p.includes(call.name)),
}));
}
}
Architecture Rationale: YAML-compatible JSON structures enable human review and Git version control. The procedure array enforces sequential execution logic while antiPatterns capture failure modes explicitly. This design mirrors human procedural memory: compressed decision trees rather than linear instruction lists.
Step 3: Progressive Disclosure Retrieval
Loading full skill manifests for every request destroys context window economics. Hermes implements a three-tier retrieval system that scales token usage with relevance, not library size.
class SkillVault {
private index: Map<string, SkillIndexEntry> = new Map();
private storage: Map<string, SkillManifest> = new Map();
async registerSkill(manifest: SkillManifest): Promise<void> {
this.index.set(manifest.name, {
name: manifest.name,
summary: manifest.description.slice(0, 120),
tags: manifest.metadata.tags,
});
this.storage.set(manifest.name, manifest);
}
async retrieveIndex(): Promise<SkillIndexEntry[]> {
return Array.from(this.index.values());
}
async retrieveFullSkill(name: string): Promise<SkillManifest | null> {
return this.storage.get(name) ?? null;
}
async retrieveDeepReference(name: string, referenceKey: string): Promise<string | null> {
const skill = this.storage.get(name);
if (!skill) return null;
// Simulates loading /references/ subdirectory content
return `Deep reference payload for ${referenceKey} in ${name}`;
}
}
Architecture Rationale: The index tier consumes ~500 tokens regardless of vault size. Full manifests load only when semantic routing matches a task to a skill. Deep references (API templates, regex patterns, config snippets) load on explicit demand. This architecture ensures that an agent managing 500 skills pays the same base context cost as one managing 50.
Step 4: GPU-Free Self-Evolution via DSPy + GEPA
Manual extraction handles immediate learning. Long-term optimization requires automated refinement. Hermes integrates DSPy with the Genetic-Pareto Prompt Evolution (GEPA) pipeline to optimize skills without GPU infrastructure.
GEPA operates in five stages:
- Trace Aggregation: Queries the SQLite execution database for historical sessions.
- Failure Diagnosis: An LLM analyzes failed traces to generate actionable side information (e.g., "selector assumed static class names").
- Parameter Mutation: DSPy treats skill instructions as optimizable parameters, generating variant procedures.
- Pareto Evaluation: Variants are scored against success rate, latency, and token consumption. Non-dominated solutions form the Pareto front.
- Skill Replacement: The highest-ranked variant updates the manifest version, preserving rollback capability.
This pipeline replaces manual prompt iteration with evolutionary search. Because DSPy compiles declarative programs into optimized prompts, GEPA can explore thousands of procedural variations using only API calls, eliminating the need for weight updates or dataset curation.
Pitfall Guide
1. Skill Inflation
Explanation: Creating skills for trivial operations (e.g., "read file", "print output") bloats the index and degrades retrieval accuracy.
Fix: Enforce a minimum complexity threshold (β₯5 tool calls or explicit error recovery). Implement deduplication using semantic similarity checks before registration.
2. Context Window Bleed
Explanation: Loading full manifests for every request causes token exhaustion and latency spikes.
Fix: Strictly enforce progressive disclosure. Never load Level 1 or Level 2 content during index scanning. Implement a hard token budget for the retrieval phase (e.g., β€1500 tokens for index + matched skill).
3. Shallow Evaluation Metrics
Explanation: Tracking only pass/fail outcomes misses structural inefficiencies like redundant tool calls or suboptimal decision branches.
Fix: Analyze full execution traces. Measure step compression ratio, error recovery time, and decision node depth. Use these metrics to trigger GEPA optimization cycles.
4. Stale Skill Drift
Explanation: External APIs change, causing previously successful skills to fail silently or produce incorrect outputs.
Fix: Implement versioned manifests with automated regression testing. Schedule periodic trace audits that flag skills with declining success rates. Trigger automatic re-evaluation when failure thresholds are breached.
5. GEPA Over-Optimization
Explanation: Optimizing exclusively for success rate can produce overly complex procedures that increase latency and token usage.
Fix: Use Pareto-front selection balancing three objectives: success rate, execution latency, and token efficiency. Reject variants that improve one metric while degrading another beyond acceptable thresholds.
6. Ignoring Deterministic Fallbacks
Explanation: Relying solely on learned skills creates brittle systems when novel edge cases appear.
Fix: Maintain a hybrid routing layer. Route tasks to skills when similarity confidence exceeds 0.85. Fall back to base agent reasoning with explicit guardrails for low-confidence matches.
7. Unsanitized Trace Storage
Explanation: Storing raw execution traces without PII or sensitive data filtering creates compliance and security risks.
Fix: Implement trace sanitization before evaluation. Strip API keys, user identifiers, and internal paths. Hash sensitive arguments while preserving structural metadata for analysis.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Single-use workflows with strict state requirements | LangGraph / LangChain | Deterministic execution, explicit state management | Low initial, high maintenance |
| Multi-agent role specialization | CrewAI / AutoGen | Fast prototyping, clear role boundaries | Medium, scales with agent count |
| Static knowledge retrieval | RAG + Vector DB | Efficient document lookup, low latency | Medium, indexing overhead |
| Recurring automation with evolving edge cases | Hermes Agent | Procedural compounding, GPU-free optimization | Higher initial setup, lower long-term |
| High-volume, low-latency inference | Fine-tuning | Compressed weights, minimal context overhead | High GPU/dataset costs, slow iteration |
Configuration Template
# hermes-runtime.config.yaml
runtime:
mode: persistent
trace_storage: sqlite
trace_retention_days: 90
sanitization:
enabled: true
patterns: ["api_key", "token", "secret", "password"]
skill_vault:
base_path: ~/.hermes/skills
index_cache_ttl: 3600
progressive_disclosure:
level_0_max_tokens: 600
level_1_auto_load: true
level_2_on_demand: true
extraction:
complexity_threshold: 5
novelty_similarity_cutoff: 0.7
auto_versioning: true
gepa_pipeline:
enabled: true
schedule: "0 2 * * 0" # Weekly at 2 AM
pareto_objectives:
- success_rate
- latency_ms
- token_count
max_generations: 50
population_size: 20
Quick Start Guide
- Initialize the runtime: Run
hermes init --mode persistent to create the skill vault directory and SQLite trace database.
- Configure extraction thresholds: Edit
hermes-runtime.config.yaml to set complexity thresholds and sanitization rules matching your compliance requirements.
- Deploy base tools: Register your initial toolset (browser automation, file I/O, API clients) using the standard Hermes tool registry interface.
- Execute baseline tasks: Run 10-20 representative workflows to populate the trace database with execution patterns and failure modes.
- Trigger first evolution cycle: Run
hermes gepa optimize --target all to generate initial skill variants and establish the Pareto front for ongoing refinement.