oh-my-agent: 9 new skills, cursor as first-class vendor, 80/100 benchmark

By Codcompass Team·2026-05-11·6 min read

Unified Agent Orchestration: Benchmarking, Security, and Cross-Vendor Consistency

Current Situation Analysis

The proliferation of AI coding agents has introduced a critical operational challenge: vendor fragmentation and behavioral drift. Engineering teams rarely standardize on a single agent tool. Instead, they run parallel workflows across Cursor, Codex, and other CLI-based agents. This heterogeneity creates immediate pain points:

Inconsistent Scaffolding: Agents frequently diverge on project initialization. One vendor might scaffold a Next.js app with an outdated version, ignore existing linting configurations, or generate UI components (like save buttons) without implementing the underlying storage logic.
Security Surface Expansion: Ad-hoc agent usage often bypasses security controls. Path traversal vulnerabilities in output arguments, lack of input validation on reference files, and susceptibility to character normalization attacks (e.g., fullwidth Unicode bypasses) expose repositories to risk.
Flawed Evaluation: Most teams rely on single-shot benchmarks to evaluate agent performance. These metrics are statistically noisy and fail to capture reliability across functional correctness, specification adherence, and engineering efficiency.

This problem is often overlooked because teams treat agents as isolated utilities rather than components of a unified orchestration layer. Without a control plane, drift accumulates silently. Recent data from the oh-my-agent (oma) project highlights the severity: unmanaged agents score significantly lower on comprehensive benchmarks compared to orchestrated workflows, and security gaps in CLI argument parsing remain common across vendor implementations.

WOW Moment: Key Findings

The most significant insight from recent benchmarking efforts is the performance delta between unified orchestration and raw vendor CLIs. The benchmark methodology itself is a differentiator: it uses a 5-axis evaluation model with multi-judge averaging across three rounds, eliminating the variance inherent in single-shot testing.

Agent Framework	Functional	Spec	Visual	Engineering	Efficiency	Total Score
oma (Unified)	35	15	20	20	10	80.6
omc	30	12	16	14	8	74.1
superpowers	28	11	17	14	7	72.9
vanilla	27	10	15	13	6	70.7
ecc	26	10	14	13	7	70.2

Why this matters:

Reliability over Peak Performance: The multi-judge approach proves that oma delivers consistent results, not just lucky single-run outputs.
Engineering Efficiency: The 20-point engineering axis rewards architectural decisions, security hardening, and maintainability. oma's lead here indicates it produces production-ready code, not just functional snippets.
Actionable Intelligence: The benchmark scores correlate directly with reduced technical debt. Teams using orchestrated workflows spend less time fixing agent-induced drift and security issues.

Core Solution

To address fragmentation, drift, and security, oh-my-agent implements a control plane that standardizes agent interaction, enforces security boundaries, and provides observable metrics. The architecture rests on four pillars:

1. Consolidated Configuration and Model Management

Vendor-specific configurations are consolidated into a single oma-config.yaml. This file defines model presets, agent routing, and auto-approval policies. The system includes a model management suite that diffs the local registry against external sources like OpenRouter and Cursor's model list.

Architecture Decision: Centralizing configuration allows for atomic updates and migration safety. Legacy mappings are auto-migrated via versioned migration scripts (e.g., migration 008), ensuring backward compatibility while enforcing modern standards.

Configuration Example:

# oma-config.yaml
version: "2.0"
model_preset: "production-optimized"

agents:
  cursor:
    vendor: "cursor"
    routing: "composer-2"
    auto_approve: true
    preset: "cursor-only"
  
  codex:
    vendor: "codex"
    cli_guard: true
    arg_terminator: "--"

security:
  path_traversal_protection: true
  mime_validation: true
  nfkc_

normalization: true

observability: auto_update_cli: true benchmark_rounds: 3


#### 2. Dynamic Skill Registration
Skills are no longer hardcoded. The system scans `.agents/skills/` directories at build time, reading frontmatter to auto-register capabilities. This enables modular expansion without touching core logic.

**Skill Definition Example:**
```yaml
# .agents/skills/oma-deepsec/skill.yaml
name: "oma-deepsec"
description: "Vercel deepsec driver integration for security auditing"
trigger_keywords:
  - "/deepsec"
  - "security-audit"
languages:
  - typescript
  - javascript
  - python
execution:
  command: "oma run deepsec"
  requires_auth: false

3. Security Hardening and Input Validation

The orchestration layer enforces strict input validation:

NFKC Normalization: Prevents hook bypasses using fullwidth Unicode characters (e.g., ｐａｒａｌｌｅｌ is normalized to parallel before evaluation).
Path Traversal Protection: Output and reference paths are validated against traversal patterns.
Magic-Byte Validation: Reference images are verified via MIME type detection, not just file extensions.
CLI Invocation Guard: A two-tier guard ensures commands like claude review this code route to workflows, while claude exec --foo is blocked from unintended execution.

4. Cross-Vendor Routing and Windows Compatibility

Cursor is promoted to a first-class vendor with dedicated routing (composer-2) and auto-approval presets. Windows support is robust, handling junction and hardlink fallbacks when symlinks fail with EPERM. Path separators are normalized across all IO layers, including gitignore handling and doctor diagnostics.

Pitfall Guide

Pitfall	Explanation	Fix
NFKC Bypass	Attackers use fullwidth Unicode to bypass keyword detectors. E.g., `ｐａｒａｌｌｅｌ` evades regex matching `parallel`.	Implement NFKC normalization on all input strings before hook evaluation.
Path Traversal	Malicious `--out` or `--reference` arguments can write files outside the intended directory.	Validate paths against traversal patterns. Use canonical path resolution.
Variadic Arg Swallowing	CLI tools like `codex exec -i` may treat the prompt as a second reference image if arguments are variadic.	Terminate variable arguments with `--` before the instruction string.
Hardcoded Skill Maps	Maintaining a static list of skills requires code changes for every new capability.	Use frontmatter-based auto-registration in `.agents/skills/`.
Windows Symlink EPERM	Symlinks often fail on Windows due to permission restrictions.	Implement fallback logic to use junctions or hardlinks when symlinks raise `EPERM`.
Single-Shot Benchmarking	Evaluating agents on a single run produces noisy, unreliable metrics.	Use multi-judge averaging across multiple rounds (e.g., 3 rounds, 3 judges).
Translation Drift	i18n files diverge across locales, leading to missing or incorrect translations.	Run `oma docs i18n` to detect drift. Use `oma docs lint` to flag em-dashes and placeholders.

Production Bundle

Action Checklist

Install oma: Use curl for macOS/Linux or irm for Windows. Verify installation with oma doctor.
Initialize Config: Run oma init to generate oma-config.yaml. Review model presets and agent routing.
Enable Skills: Place skill definitions in .agents/skills/. Verify auto-registration via oma skills list.
Run Security Audit: Execute oma deepsec to scan for path traversal and input validation gaps.
Benchmark Workflow: Run oma benchmark to measure performance against the 5-axis model. Compare scores to baseline.
Check Model Registry: Run oma model:check to diff local models against OpenRouter/Cursor. Apply patches if needed.
Validate i18n: Run oma docs i18n to detect translation drift. Fix flagged issues with oma docs lint.

Decision Matrix

Scenario	Recommended Approach	Why	Cost Impact
Multi-Vendor Team	Use oma with consolidated config	Standardizes behavior across Cursor, Codex, etc. Reduces drift.	Low setup cost; high ROI on consistency.
Security-First Project	Enable `oma-deepsec` and path validation	Prevents path traversal and input bypasses.	Minimal overhead; critical risk reduction.
Benchmarking Agents	Use oma's multi-judge benchmark	Provides reliable, reproducible metrics.	Requires compute for 3 rounds; accurate results.
Windows Development	Use `install.ps1` and junction fallbacks	Handles symlink limitations gracefully.	No cost; ensures cross-platform compatibility.
Rapid Prototyping	Use `oma-skill-creator`	Generates new skills quickly from templates.	Accelerates development; maintains structure.

Configuration Template

# oma-config.yaml
# Production-ready configuration template

version: "2.0"
model_preset: "high-fidelity"

agents:
  cursor:
    vendor: "cursor"
    routing: "composer-2"
    auto_approve: true
    preset: "cursor-only"
  
  codex:
    vendor: "codex"
    cli_guard: true
    arg_terminator: "--"
    variadic_fix: true

security:
  path_traversal_protection: true
  mime_validation: true
  nfkc_normalization: true
  magic_byte_check: true

observability:
  auto_update_cli: true
  benchmark_rounds: 3
  benchmark_judges: 3

skills:
  auto_register: true
  directory: ".agents/skills"

docs:
  i18n_locales:
    - en
    - ja
    - zh
  lint_rules:
    - em_dash_cjk
    - placeholder_validation

Quick Start Guide

Install:

# macOS / Linux
curl -fsSL https://raw.githubusercontent.com/first-fluke/oh-my-agent/main/cli/install.sh | bash

# Windows
irm https://raw.githubusercontent.com/first-fluke/oh-my-agent/main/cli/install.ps1 | iex

Initialize:
```
oma init
```
Verify:
```
oma doctor
```
Run Benchmark:
```
oma benchmark
```

Deploy Skill:

mkdir -p .agents/skills/oma-custom
# Add skill.yaml to directory
oma skills list

This orchestration layer transforms fragmented agent usage into a standardized, secure, and measurable workflow. By consolidating configuration, enforcing security boundaries, and providing reliable benchmarks, teams can achieve consistent, production-grade results across all AI coding agents.