GPT-5.5 is OpenAI's best model. But paying more for it makes no sense.

Current Situation Analysis

Production agentic workflows face a critical misalignment between raw model capability and operational economics. Traditional benchmarking relies on zero-shot/baseline scores, which fail to capture real-world performance where domain-specific skills (SKILL.md) dominate task execution. This creates a false premium on newer model releases, as raw capability gains (e.g., GPT-5.5's 75.6 baseline vs GPT-5.4's 74.1) flatten to near-parity (~89.3–89.4) when structured skills are injected into the context window.

Furthermore, failure modes in model selection often stem from three systemic issues:

Token Bloat Inflation: Older or transitional architectures (e.g., GPT-5.3) exhibit unoptimized token generation patterns, inflating per-run costs without delivering proportional score improvements.
Latency-Cost Blind Spots: Engineering teams frequently prioritize raw throughput or score deltas while ignoring inference latency constraints, leading to suboptimal routing decisions for time-sensitive agents.
Subjective Evaluation Noise: Relying on LLM-as-a-judge or open-ended grading introduces variance that masks true capability gaps. Binary, rubric-driven evaluation is required to isolate skill-augmented lift from baseline noise.

Traditional cost-per-token pricing models are insufficient for agentic systems where task completion time, context window utilization, and deterministic rubric scoring dictate actual TCO (Total Cost of Ownership).

WOW Moment: Key Findings

Benchmarking 1,742 tests across 45 task scenarios, 11 engineering skills, and 6 averaged runs per scenario reveals a clear efficiency frontier. The data demonstrates that skill-augmented performance compresses raw capability gaps, making cost-per-score and latency the primary decision vectors.

Model	Skill-Adjusted Score	Cost/Run	Score/$	Avg Latency	Avg Skill Lift
GPT-5.5	89.4	$0.49	182	89.5s	+13.8
GPT-5.4	89.3	$0.30	298	135.4s	+15.2
GPT-5.3-Codex	83.9	$0.44	191	N/A	+18.4
Claude Opus 4-7	93.4	$1.00	93	N/A	+12.6
Cursor Composer-2	89.6	$0.23	389	N/A	+15.4

Key Findings:

Functional Parity: GPT-5.5 and GPT-5.4 score within 0.1 points of each other (89.4 vs 89.3) when domain skills are loaded, rendering raw capability differences operationally irrelevant for skill-augmented tasks.
Cost Premium Justification: GPT-5.5 commands a 63% price premium ($0.49 vs $0.30) for identical output quality. Its only defensible advantage is latency reduction (~45s faster per run).
Token Bloat Penalty: GPT-5.3 scores 5.4 points lower than GPT-5.4 while costing 47% more per run, confirming unoptimized token generation inflates costs without improving outcomes.
ROI Sweet Spot: Cursor Composer-2 delivers the highest Score/$ (389) at $0.23/run, while GPT-5.4 remains the optimal OpenAI choice for cost-constrained deployments.

Core Solution

The evaluation architecture leverages Tessl, an agentic evaluation platform that isolates skill-augmented lift through deterministic rubric checklists and structured context injection.

Technical Implementation Architecture

Skill Context Injection: A SKILL.md file acts as a structured markdown document containing domain rules, architectural patterns, and implementation examples. During the with-skill run, this file is loaded into the agent's context alongside the task prompt. Baseline runs exclude this context.
Deterministic Rubric Scoring: Evaluation relies on binary pass/fail criteria to eliminate subjective variance. Each criterion maps to an objectively verifiable state (e.g., file existence, dependency presence, CLI flag usage).
Statistical Averaging Protocol: Each scenario executes 6 independent runs. Outputs are scored against the pre-written rubric, and final metrics are averaged to smooth stochastic variance.

Rubric Checklist Example

The following rubric demonstrates the binary evaluation structure used to score agentic output:

Criterion	Points	Pass Condition
neostandard installed	10	neostandard present in devDependencies
standard uninstalled	10	standard absent from devDependencies
Flat config file	10	eslint.config.js or .mjs exists, not .eslintrc*
neostandard in config	10	Config imports from neostandard and calls neostandard()
lint script uses eslint	10	package.json lint script runs eslint ., not neostandard . or standard .
migrate command used	10	Instructions reference npx neostandard --migrate to generate the config
lint:fix script present	8	lint:fix script runs eslint . --fix
CI uses non-fix run	8	CI config runs lint without --fix
standard config removed	8	No top-level standard key in package.json
lint-staged uses eslint	8	Pre-commit hook runs eslint --fix, not neostandard or standard
eslint@9 installed	8	eslint at version 9.x in devDependencies

Scoring Logic: A model that migrates correctly but leaves standard in devDependencies scores 91/101. Creating eslint.config.js alongside .eslintrc.json instead of replacing it triggers a 0 on three criteria simultaneously. This deterministic approach ensures score deltas reflect genuine agentic capability rather than evaluation noise.

Pitfall Guide

Chasing Raw Baseline Capability: GPT-5.5's 75.6 baseline score suggests superior raw capability, but skill-augmented runs compress this to a 0.1-point delta vs GPT-5.4. In production, domain skills dominate task execution; baseline scores are poor proxies for agentic ROI.
Ignoring Token Bloat Economics: GPT-5.3 costs $0.44/run but scores 83.9 due to inefficient token generation. Higher per-token pricing combined with bloat creates a negative cost-performance curve. Always validate token efficiency alongside raw scores.
Overlooking Latency-Cost Trade-offs: GPT-5.5 reduces inference time by ~45s (89.5s vs 135.4s). If your agent pipeline is latency-constrained (e.g., real-time IDE suggestions), the 63% cost premium is defensible. For batch or async workflows, the premium is unjustified.
Relying on Subjective Evaluation Metrics: Open-ended grading or LLM-as-a-judge introduces variance that masks true capability gaps. Binary rubric checklists (e.g., "Does it use PKCE method S256?") eliminate noise and ensure reproducible score deltas.
Assuming Linear Price-to-Performance Scaling: Claude Opus 4-7 achieves the highest score (93.4) but costs $1.00/run (Score/$ = 93). GPT-5.4 and Cursor Composer-2 deliver near-identical performance at 298 and 389 Score/$ respectively. Highest score ≠ best operational fit.
Neglecting Skill-Augmented Lift Analysis: Evaluating models without domain skills misses the core value proposition of agentic systems. Skills add +10 to +18 points across models, flattening raw capability gaps. Always benchmark baseline vs with-skill to isolate true lift.
Single-Run Evaluation Variance: Stochastic model behavior causes score fluctuations across individual runs. Averaging 6+ independent runs per scenario is mandatory to establish statistically significant deltas and avoid routing decisions based on outliers.

Deliverables

Blueprint: Agentic Evaluation Framework – Complete methodology for structuring SKILL.md context injection, designing binary rubric checklists, and executing averaged run protocols to isolate skill-augmented lift from baseline noise.
Checklist: Model Selection & Cost Optimization – Decision matrix covering Score/$ thresholds, latency tolerance bands, token efficiency validation, and skill-dependency mapping for production agent routing.
Configuration Templates:
- SKILL.md structure template (rules, patterns, examples, versioning)
- Rubric checklist format (criterion, points, pass condition, binary validation logic)
- Evaluation run parameters (6-run averaging, context window management, rubric scoring pipeline)
- All templates and full eval results are available via the Tessl registry: simon/skills