oh-my-agent: 9 new skills, cursor as first-class vendor, 80/100 benchmark
Unified Agent Orchestration: Benchmarking, Security, and Cross-Vendor Consistency
Current Situation Analysis
The proliferation of AI coding agents has introduced a critical operational challenge: vendor fragmentation and behavioral drift. Engineering teams rarely standardize on a single agent tool. Instead, they run parallel workflows across Cursor, Codex, and other CLI-based agents. This heterogeneity creates immediate pain points:
- Inconsistent Scaffolding: Agents frequently diverge on project initialization. One vendor might scaffold a Next.js app with an outdated version, ignore existing linting configurations, or generate UI components (like save buttons) without implementing the underlying storage logic.
- Security Surface Expansion: Ad-hoc agent usage often bypasses security controls. Path traversal vulnerabilities in output arguments, lack of input validation on reference files, and susceptibility to character normalization attacks (e.g., fullwidth Unicode bypasses) expose repositories to risk.
- Flawed Evaluation: Most teams rely on single-shot benchmarks to evaluate agent performance. These metrics are statistically noisy and fail to capture reliability across functional correctness, specification adherence, and engineering efficiency.
This problem is often overlooked because teams treat agents as isolated utilities rather than components of a unified orchestration layer. Without a control plane, drift accumulates silently. Recent data from the oh-my-agent (oma) project highlights the severity: unmanaged agents score significantly lower on comprehensive benchmarks compared to orchestrated workflows, and security gaps in CLI argument parsing remain common across vendor implementations.
WOW Moment: Key Findings
The most significant insight from recent benchmarking efforts is the performance delta between unified orchestration and raw vendor CLIs. The benchmark methodology itself is a differentiator: it uses a 5-axis evaluation model with multi-judge averaging across three rounds, eliminating the variance inherent in single-shot testing.
| Agent Framework | Functional | Spec | Visual | Engineering | Efficiency | Total Score |
|---|---|---|---|---|---|---|
| oma (Unified) | 35 | 15 | 20 | 20 | 10 | 80.6 |
| omc | 30 | 12 | 16 | 14 | 8 | 74.1 |
| superpowers | 28 | 11 | 17 | 14 | 7 | 72.9 |
| vanilla | 27 | 10 | 15 | 13 | 6 | 70.7 |
| ecc | 26 | 10 | 14 | 13 | 7 | 70.2 |
Why this matters:
- Reliability over Peak Performance: The multi-judge approach proves that oma delivers consistent results, not just lucky single-run outputs.
- Engineering Efficiency: The 20-point engineering axis rewards architectural decisions, security hardening, and maintainability. oma's lead here indicates it produces production-ready code, not just functional snippets.
- Actionable Intelligence: The benchmark scores correlate directly with reduced technical debt. Teams using orchestrated workflows spend less time fixing agent-induced drift and security issues.
Core Solution
To address fragmentation, drift, and security, oh-my-agent implements a control plane that standardizes agent interaction, enforces security boundaries, and provides observable metrics. The architecture rests on four pillars:
1. Consolidated Configuration and Model Management
Vendor-specific configurations are consolidated into a single oma-config.yaml. This file defines model presets, agent routing, and auto-approval policies. The system includes a model management suite that diffs the local registry against external sources like OpenRouter and Cursor's model list.
Architecture Decision: Centralizing configuration allows for atomic updates and migration safety. Legacy mappings are auto-migrated via versioned migration scripts (e.g., migration 008), ensuring backward compatibility while enforcing modern standards.
Configuration Example:
# oma-config.yaml
version: "2.0"
model_preset: "production-optimized"
agents:
cursor:
vendor: "cursor"
routing: "composer-2"
auto_approve: true
preset: "cursor-only"
codex:
vendor: "codex"
cli_guard: true
arg_terminator: "--"
security:
path_traversal_protection: true
mime_validation: true
nfkc_
normalization: true
observability: auto_update_cli: true benchmark_rounds: 3
#### 2. Dynamic Skill Registration
Skills are no longer hardcoded. The system scans `.agents/skills/` directories at build time, reading frontmatter to auto-register capabilities. This enables modular expansion without touching core logic.
**Skill Definition Example:**
```yaml
# .agents/skills/oma-deepsec/skill.yaml
name: "oma-deepsec"
description: "Vercel deepsec driver integration for security auditing"
trigger_keywords:
- "/deepsec"
- "security-audit"
languages:
- typescript
- javascript
- python
execution:
command: "oma run deepsec"
requires_auth: false
3. Security Hardening and Input Validation
The orchestration layer enforces strict input validation:
- NFKC Normalization: Prevents hook bypasses using fullwidth Unicode characters (e.g.,
parallelis normalized toparallelbefore evaluation). - Path Traversal Protection: Output and reference paths are validated against traversal patterns.
- Magic-Byte Validation: Reference images are verified via MIME type detection, not just file extensions.
- CLI Invocation Guard: A two-tier guard ensures commands like
claude review this coderoute to workflows, whileclaude exec --foois blocked from unintended execution.
4. Cross-Vendor Routing and Windows Compatibility
Cursor is promoted to a first-class vendor with dedicated routing (composer-2) and auto-approval presets. Windows support is robust, handling junction and hardlink fallbacks when symlinks fail with EPERM. Path separators are normalized across all IO layers, including gitignore handling and doctor diagnostics.
Pitfall Guide
| Pitfall | Explanation | Fix |
|---|---|---|
| NFKC Bypass | Attackers use fullwidth Unicode to bypass keyword detectors. E.g., parallel evades regex matching parallel. | Implement NFKC normalization on all input strings before hook evaluation. |
| Path Traversal | Malicious --out or --reference arguments can write files outside the intended directory. | Validate paths against traversal patterns. Use canonical path resolution. |
| Variadic Arg Swallowing | CLI tools like codex exec -i may treat the prompt as a second reference image if arguments are variadic. | Terminate variable arguments with -- before the instruction string. |
| Hardcoded Skill Maps | Maintaining a static list of skills requires code changes for every new capability. | Use frontmatter-based auto-registration in .agents/skills/. |
| Windows Symlink EPERM | Symlinks often fail on Windows due to permission restrictions. | Implement fallback logic to use junctions or hardlinks when symlinks raise EPERM. |
| Single-Shot Benchmarking | Evaluating agents on a single run produces noisy, unreliable metrics. | Use multi-judge averaging across multiple rounds (e.g., 3 rounds, 3 judges). |
| Translation Drift | i18n files diverge across locales, leading to missing or incorrect translations. | Run oma docs i18n to detect drift. Use oma docs lint to flag em-dashes and placeholders. |
Production Bundle
Action Checklist
- Install oma: Use
curlfor macOS/Linux orirmfor Windows. Verify installation withoma doctor. - Initialize Config: Run
oma initto generateoma-config.yaml. Review model presets and agent routing. - Enable Skills: Place skill definitions in
.agents/skills/. Verify auto-registration viaoma skills list. - Run Security Audit: Execute
oma deepsecto scan for path traversal and input validation gaps. - Benchmark Workflow: Run
oma benchmarkto measure performance against the 5-axis model. Compare scores to baseline. - Check Model Registry: Run
oma model:checkto diff local models against OpenRouter/Cursor. Apply patches if needed. - Validate i18n: Run
oma docs i18nto detect translation drift. Fix flagged issues withoma docs lint.
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|---|---|---|
| Multi-Vendor Team | Use oma with consolidated config | Standardizes behavior across Cursor, Codex, etc. Reduces drift. | Low setup cost; high ROI on consistency. |
| Security-First Project | Enable oma-deepsec and path validation | Prevents path traversal and input bypasses. | Minimal overhead; critical risk reduction. |
| Benchmarking Agents | Use oma's multi-judge benchmark | Provides reliable, reproducible metrics. | Requires compute for 3 rounds; accurate results. |
| Windows Development | Use install.ps1 and junction fallbacks | Handles symlink limitations gracefully. | No cost; ensures cross-platform compatibility. |
| Rapid Prototyping | Use oma-skill-creator | Generates new skills quickly from templates. | Accelerates development; maintains structure. |
Configuration Template
# oma-config.yaml
# Production-ready configuration template
version: "2.0"
model_preset: "high-fidelity"
agents:
cursor:
vendor: "cursor"
routing: "composer-2"
auto_approve: true
preset: "cursor-only"
codex:
vendor: "codex"
cli_guard: true
arg_terminator: "--"
variadic_fix: true
security:
path_traversal_protection: true
mime_validation: true
nfkc_normalization: true
magic_byte_check: true
observability:
auto_update_cli: true
benchmark_rounds: 3
benchmark_judges: 3
skills:
auto_register: true
directory: ".agents/skills"
docs:
i18n_locales:
- en
- ja
- zh
lint_rules:
- em_dash_cjk
- placeholder_validation
Quick Start Guide
- Install:
# macOS / Linux curl -fsSL https://raw.githubusercontent.com/first-fluke/oh-my-agent/main/cli/install.sh | bash # Windows irm https://raw.githubusercontent.com/first-fluke/oh-my-agent/main/cli/install.ps1 | iex - Initialize:
oma init - Verify:
oma doctor - Run Benchmark:
oma benchmark - Deploy Skill:
mkdir -p .agents/skills/oma-custom # Add skill.yaml to directory oma skills list
This orchestration layer transforms fragmented agent usage into a standardized, secure, and measurable workflow. By consolidating configuration, enforcing security boundaries, and providing reliable benchmarks, teams can achieve consistent, production-grade results across all AI coding agents.
