GPT-5.5 is OpenAI's best model. But paying more for it makes no sense.
GPT-5.5 is OpenAI's best model. But paying more for it makes no sense.
Current Situation Analysis
Production agentic workflows face a critical misalignment between raw model capability and operational economics. Traditional benchmarking relies on zero-shot/baseline scores, which fail to capture real-world performance where domain-specific skills (SKILL.md) dominate task execution. This creates a false premium on newer model releases, as raw capability gains (e.g., GPT-5.5's 75.6 baseline vs GPT-5.4's 74.1) flatten to near-parity (~89.3β89.4) when structured skills are injected into the context window.
Furthermore, failure modes in model selection often stem from three systemic issues:
- Token Bloat Inflation: Older or transitional architectures (e.g., GPT-5.3) exhibit unoptimized token generation patterns, inflating per-run costs without delivering proportional score improvements.
- Latency-Cost Blind Spots: Engineering teams frequently prioritize raw throughput or score deltas while ignoring inference latency constraints, leading to suboptimal routing decisions for time-sensitive agents.
- Subjective Evaluation Noise: Relying on LLM-as-a-judge or open-ended grading introduces variance that masks true capability gaps. Binary, rubric-driven evaluation is required to isolate skill-augmented lift from baseline noise.
Traditional cost-per-token pricing models are insufficient for agentic systems where task completion time, context window utilization, and deterministic rubric scoring dictate actual TCO (Total Cost of Ownership).
WOW Moment: Key Findings
Benchmarking 1,742 tests across 45 task scenarios, 11 engineering skills, and 6 averaged runs per scenario reveals a clear efficiency frontier. The data demonstrates that skill-augmented performance compresses raw capability gaps, making cost-per-score and latency the primary decision vectors.
| Model | Skill-Adjusted Score | Cost/Run | Score/$ | Avg Latency | Avg Skill Lift |
|---|---|---|---|---|---|
| GPT-5.5 | 89.4 | $0.49 | 182 | 89.5s | +13.8 |
| GPT-5.4 | 89.3 | $0.30 | 298 | 135.4s | +15.2 |
| GPT-5.3-Codex | 83.9 | $0.44 | 191 | N/A | +18.4 |
| Claude Opus 4-7 | 93.4 | $1.00 | 93 | N/A | +12.6 |
| Cursor Composer-2 | 89.6 | $0.23 | 389 | N/A | +15.4 |
Key Findings:
- Functional Parity: GPT-5.5 and GPT-5.4 score within 0.1 points of each other (89.4 vs 89.3) when domain skills are loaded, rendering raw capability differences operationally irrelevant for skill-augmented tasks.
- Cost Premium Justification: GPT-5.5 commands a 63% price premium ($0.49 vs $0.30) for identical output quality. Its only defensible advantage is latency reduction (~45s faster per run).
- Token Bloat Penalty: GPT-5.3 scores 5.4 points lower than GPT-5.4 while costing 47% more per run, confirming unoptimized token generation inflates costs without improving outcomes.
- ROI Sweet Spot: Cursor Composer-2 delivers the highest Score/$ (389) at $0.23/run, while GPT-5.4 remains the optimal OpenAI choice for cost-constrained deployments.
Core Solution
The evaluation architecture leverages Tessl, an agentic evaluation platform that isolates skill-augmented lift through deterministic rubric checklists and structured context injection.
Technical Implementation Architecture
- Skill Context Injection: A
SKILL.mdfile acts as a structured markdown document containing domain rules, architectural patterns, and implementation examples. During thewith-skillrun, this file is loaded into the agent's context alongside the task prompt. Baseline runs exclude this context. - Deterministic Rubric Scoring: Evaluation relies on binary pass/fail criteria to eliminate subjective variance. Each criterion maps to an objectively verifiable state (e.g., file existence, dependency presence, CLI flag usage).
- Statistical Averaging Protocol: Each scenario executes 6 independent runs. Outputs are scored against the pre-written rubric, and final metrics are averaged to smooth stochastic variance.
Rubric Checklist Example
The following rubric demonstrates the binary evaluation structure used to score agentic output:
| Criterion | Points | Pass Condition |
|---|---|---|
| neostandard installed | 10 | neostandard present in devDependencies |
| standard uninstalled | 10 | standard absent from devDependencies |
| Flat config file | 10 | eslint.config.js or .mjs exists, not .eslintrc* |
| neostandard in config | 10 | Config imports from neostandard and calls neostandard() |
| lint script uses eslint | 10 | package.json lint script runs eslint ., not neostandard . or standard . |
| migrate command used | 10 | Instructions reference npx neostandard --migrate to generate the config |
| lint:fix script present | 8 | lint:fix script runs eslint . --fix |
| CI uses non-fix run | 8 | CI config runs lint without --fix |
| standard config removed | 8 | No top-level standard key in package.json |
| lint-staged uses eslint | 8 | Pre-commit hook runs eslint --fix, not neostandard or standard |
| eslint@9 installed | 8 | eslint at version 9.x in devDependencies |
Scoring Logic: A model that migrates correctly but leaves standard in devDependencies scores 91/101. Creating eslint.config.js alongside .eslintrc.json instead of replacing it triggers a 0 on three criteria simultaneously. This deterministic approach ensures score deltas reflect genuine agentic capability rather than evaluation noise.
Pitfall Guide
- Chasing Raw Baseline Capability: GPT-5.5's 75.6 baseline score suggests superior raw capability, but skill-augmented runs compress this to a 0.1-point delta vs GPT-5.4. In production, domain skills dominate task execution; baseline scores are poor proxies for agentic ROI.
- Ignoring Token Bloat Economics: GPT-5.3 costs $0.44/run but scores 83.9 due to inefficient token generation. Higher per-token pricing combined with bloat creates a negative cost-performance curve. Always validate token efficiency alongside raw scores.
- Overlooking Latency-Cost Trade-offs: GPT-5.5 reduces inference time by ~45s (89.5s vs 135.4s). If your agent pipeline is latency-constrained (e.g., real-time IDE suggestions), the 63% cost premium is defensible. For batch or async workflows, the premium is unjustified.
- Relying on Subjective Evaluation Metrics: Open-ended grading or LLM-as-a-judge introduces variance that masks true capability gaps. Binary rubric checklists (e.g., "Does it use PKCE method S256?") eliminate noise and ensure reproducible score deltas.
- Assuming Linear Price-to-Performance Scaling: Claude Opus 4-7 achieves the highest score (93.4) but costs $1.00/run (Score/$ = 93). GPT-5.4 and Cursor Composer-2 deliver near-identical performance at 298 and 389 Score/$ respectively. Highest score β best operational fit.
- Neglecting Skill-Augmented Lift Analysis: Evaluating models without domain skills misses the core value proposition of agentic systems. Skills add +10 to +18 points across models, flattening raw capability gaps. Always benchmark
baselinevswith-skillto isolate true lift. - Single-Run Evaluation Variance: Stochastic model behavior causes score fluctuations across individual runs. Averaging 6+ independent runs per scenario is mandatory to establish statistically significant deltas and avoid routing decisions based on outliers.
Deliverables
- Blueprint: Agentic Evaluation Framework β Complete methodology for structuring
SKILL.mdcontext injection, designing binary rubric checklists, and executing averaged run protocols to isolate skill-augmented lift from baseline noise. - Checklist: Model Selection & Cost Optimization β Decision matrix covering Score/$ thresholds, latency tolerance bands, token efficiency validation, and skill-dependency mapping for production agent routing.
- Configuration Templates:
SKILL.mdstructure template (rules, patterns, examples, versioning)- Rubric checklist format (criterion, points, pass condition, binary validation logic)
- Evaluation run parameters (6-run averaging, context window management, rubric scoring pipeline)
- All templates and full eval results are available via the Tessl registry:
simon/skills
