Back to KB
Difficulty
Intermediate
Read Time
6 min

oh-my-agent: 9 new skills, cursor as first-class vendor, 80/100 benchmark

By Codcompass Team··6 min read

Unified Agent Orchestration: Benchmarking, Security, and Cross-Vendor Consistency

Current Situation Analysis

The proliferation of AI coding agents has introduced a critical operational challenge: vendor fragmentation and behavioral drift. Engineering teams rarely standardize on a single agent tool. Instead, they run parallel workflows across Cursor, Codex, and other CLI-based agents. This heterogeneity creates immediate pain points:

  1. Inconsistent Scaffolding: Agents frequently diverge on project initialization. One vendor might scaffold a Next.js app with an outdated version, ignore existing linting configurations, or generate UI components (like save buttons) without implementing the underlying storage logic.
  2. Security Surface Expansion: Ad-hoc agent usage often bypasses security controls. Path traversal vulnerabilities in output arguments, lack of input validation on reference files, and susceptibility to character normalization attacks (e.g., fullwidth Unicode bypasses) expose repositories to risk.
  3. Flawed Evaluation: Most teams rely on single-shot benchmarks to evaluate agent performance. These metrics are statistically noisy and fail to capture reliability across functional correctness, specification adherence, and engineering efficiency.

This problem is often overlooked because teams treat agents as isolated utilities rather than components of a unified orchestration layer. Without a control plane, drift accumulates silently. Recent data from the oh-my-agent (oma) project highlights the severity: unmanaged agents score significantly lower on comprehensive benchmarks compared to orchestrated workflows, and security gaps in CLI argument parsing remain common across vendor implementations.

WOW Moment: Key Findings

The most significant insight from recent benchmarking efforts is the performance delta between unified orchestration and raw vendor CLIs. The benchmark methodology itself is a differentiator: it uses a 5-axis evaluation model with multi-judge averaging across three rounds, eliminating the variance inherent in single-shot testing.

Agent FrameworkFunctionalSpecVisualEngineeringEfficiencyTotal Score
oma (Unified)351520201080.6
omc30121614874.1
superpowers28111714772.9
vanilla27101513670.7
ecc26101413770.2

Why this matters:

  • Reliability over Peak Performance: The multi-judge approach proves that oma delivers consistent results, not just lucky single-run outputs.
  • Engineering Efficiency: The 20-point engineering axis rewards architectural decisions, security hardening, and maintainability. oma's lead here indicates it produces production-ready code, not just functional snippets.
  • Actionable Intelligence: The benchmark scores correlate directly with reduced technical debt. Teams using orchestrated workflows spend less time fixing agent-induced drift and security issues.

Core Solution

To address fragmentation, drift, and security, oh-my-agent implements a control plane that standardizes agent interaction, enforces security boundaries, and provides observable metrics. The architecture rests on four pillars:

1. Consolidated Configuration and Model Management

Vendor-specific configurations are consolidated into a single oma-config.yaml. This file defines model presets, agent routing, and auto-approval policies. The system includes a model management suite that diffs the local registry against external sources like OpenRouter and Cursor's model list.

Architecture Decision: Centralizing configuration allows for atomic updates and migration safety. Legacy mappings are auto-migrated via versioned migration scripts (e.g., migration 008), ensuring backward compatibility while enforcing modern standards.

Configuration Example:

# oma-config.yaml
version: "2.0"
model_preset: "production-optimized"

agents:
  cursor:
    vendor: "cursor"
    routing: "composer-2"
    auto_approve: true
    preset: "cursor-only"
  
  codex:
    vendor: "codex"
    cli_guard: true
    arg_terminator: "--"

security:
  path_traversal_protection: true
  mime_validation: true
  nfkc_

normalization: true

observability: auto_update_cli: true benchmark_rounds: 3


#### 2. Dynamic Skill Registration
Skills are no longer hardcoded. The system scans `.agents/skills/` directories at build time, reading frontmatter to auto-register capabilities. This enables modular expansion without touching core logic.

**Skill Definition Example:**
```yaml
# .agents/skills/oma-deepsec/skill.yaml
name: "oma-deepsec"
description: "Vercel deepsec driver integration for security auditing"
trigger_keywords:
  - "/deepsec"
  - "security-audit"
languages:
  - typescript
  - javascript
  - python
execution:
  command: "oma run deepsec"
  requires_auth: false

3. Security Hardening and Input Validation

The orchestration layer enforces strict input validation:

  • NFKC Normalization: Prevents hook bypasses using fullwidth Unicode characters (e.g., parallel is normalized to parallel before evaluation).
  • Path Traversal Protection: Output and reference paths are validated against traversal patterns.
  • Magic-Byte Validation: Reference images are verified via MIME type detection, not just file extensions.
  • CLI Invocation Guard: A two-tier guard ensures commands like claude review this code route to workflows, while claude exec --foo is blocked from unintended execution.

4. Cross-Vendor Routing and Windows Compatibility

Cursor is promoted to a first-class vendor with dedicated routing (composer-2) and auto-approval presets. Windows support is robust, handling junction and hardlink fallbacks when symlinks fail with EPERM. Path separators are normalized across all IO layers, including gitignore handling and doctor diagnostics.

Pitfall Guide

PitfallExplanationFix
NFKC BypassAttackers use fullwidth Unicode to bypass keyword detectors. E.g., parallel evades regex matching parallel.Implement NFKC normalization on all input strings before hook evaluation.
Path TraversalMalicious --out or --reference arguments can write files outside the intended directory.Validate paths against traversal patterns. Use canonical path resolution.
Variadic Arg SwallowingCLI tools like codex exec -i may treat the prompt as a second reference image if arguments are variadic.Terminate variable arguments with -- before the instruction string.
Hardcoded Skill MapsMaintaining a static list of skills requires code changes for every new capability.Use frontmatter-based auto-registration in .agents/skills/.
Windows Symlink EPERMSymlinks often fail on Windows due to permission restrictions.Implement fallback logic to use junctions or hardlinks when symlinks raise EPERM.
Single-Shot BenchmarkingEvaluating agents on a single run produces noisy, unreliable metrics.Use multi-judge averaging across multiple rounds (e.g., 3 rounds, 3 judges).
Translation Drifti18n files diverge across locales, leading to missing or incorrect translations.Run oma docs i18n to detect drift. Use oma docs lint to flag em-dashes and placeholders.

Production Bundle

Action Checklist

  • Install oma: Use curl for macOS/Linux or irm for Windows. Verify installation with oma doctor.
  • Initialize Config: Run oma init to generate oma-config.yaml. Review model presets and agent routing.
  • Enable Skills: Place skill definitions in .agents/skills/. Verify auto-registration via oma skills list.
  • Run Security Audit: Execute oma deepsec to scan for path traversal and input validation gaps.
  • Benchmark Workflow: Run oma benchmark to measure performance against the 5-axis model. Compare scores to baseline.
  • Check Model Registry: Run oma model:check to diff local models against OpenRouter/Cursor. Apply patches if needed.
  • Validate i18n: Run oma docs i18n to detect translation drift. Fix flagged issues with oma docs lint.

Decision Matrix

ScenarioRecommended ApproachWhyCost Impact
Multi-Vendor TeamUse oma with consolidated configStandardizes behavior across Cursor, Codex, etc. Reduces drift.Low setup cost; high ROI on consistency.
Security-First ProjectEnable oma-deepsec and path validationPrevents path traversal and input bypasses.Minimal overhead; critical risk reduction.
Benchmarking AgentsUse oma's multi-judge benchmarkProvides reliable, reproducible metrics.Requires compute for 3 rounds; accurate results.
Windows DevelopmentUse install.ps1 and junction fallbacksHandles symlink limitations gracefully.No cost; ensures cross-platform compatibility.
Rapid PrototypingUse oma-skill-creatorGenerates new skills quickly from templates.Accelerates development; maintains structure.

Configuration Template

# oma-config.yaml
# Production-ready configuration template

version: "2.0"
model_preset: "high-fidelity"

agents:
  cursor:
    vendor: "cursor"
    routing: "composer-2"
    auto_approve: true
    preset: "cursor-only"
  
  codex:
    vendor: "codex"
    cli_guard: true
    arg_terminator: "--"
    variadic_fix: true

security:
  path_traversal_protection: true
  mime_validation: true
  nfkc_normalization: true
  magic_byte_check: true

observability:
  auto_update_cli: true
  benchmark_rounds: 3
  benchmark_judges: 3

skills:
  auto_register: true
  directory: ".agents/skills"

docs:
  i18n_locales:
    - en
    - ja
    - zh
  lint_rules:
    - em_dash_cjk
    - placeholder_validation

Quick Start Guide

  1. Install:
    # macOS / Linux
    curl -fsSL https://raw.githubusercontent.com/first-fluke/oh-my-agent/main/cli/install.sh | bash
    
    # Windows
    irm https://raw.githubusercontent.com/first-fluke/oh-my-agent/main/cli/install.ps1 | iex
    
  2. Initialize:
    oma init
    
  3. Verify:
    oma doctor
    
  4. Run Benchmark:
    oma benchmark
    
  5. Deploy Skill:
    mkdir -p .agents/skills/oma-custom
    # Add skill.yaml to directory
    oma skills list
    

This orchestration layer transforms fragmented agent usage into a standardized, secure, and measurable workflow. By consolidating configuration, enforcing security boundaries, and providing reliable benchmarks, teams can achieve consistent, production-grade results across all AI coding agents.