Agentick: A Unified Benchmark for General Sequential Decision-Making Agents
Unified Evaluation of Sequential Decision Agents: Cross-Paradigm Benchmarks and Performance Insights
Current Situation Analysis
The landscape of autonomous agent research is currently fractured by paradigm silos. Reinforcement Learning (RL) agents are evaluated on state-vector interactions and reward convergence, while Large Language Model (LLM) agents are tested on text-based reasoning and tool use. Vision-Language Models (VLMs) occupy a third space, and hybrid architectures bridge these gaps. This fragmentation creates a critical evaluation gap: there is no standardized mechanism to compare a gradient-based RL agent against a prompt-based LLM agent, nor can researchers fairly assess whether a foundation model outperforms a human in sequential decision-making.
This problem is often overlooked because benchmark design has historically been tailored to specific model architectures. RL benchmarks assume differentiable or discrete action spaces with dense rewards, while LLM benchmarks focus on static question-answering or single-step tool calls. Consequently, progress claims are frequently non-comparable. A model might score 90% on an LLM-specific benchmark while failing basic planning tasks that an RL agent solves trivially, yet the literature lacks a unified metric to expose these weaknesses.
Recent empirical analysis involving 27 distinct agent configurations and over 90,000 evaluation episodes confirms that isolated metrics are insufficient. The data reveals that no single paradigm dominates across all capability dimensions. Performance varies drastically based on task structure, observation modality, and reasoning requirements. Without a unified framework that normalizes performance against oracle baselines and supports multiple observation types, the field cannot accurately measure convergence toward general sequential decision-making.
WOW Moment: Key Findings
A comprehensive cross-paradigm evaluation yields counterintuitive insights that challenge common assumptions about agent capabilities. The most significant finding is the divergence between generalist foundation models and specialized RL algorithms, alongside the critical impact of observation formatting and reasoning wrappers.
| Approach | Overall Performance | Planning & Multi-Agent | Reasoning Impact | Observation Modality Preference |
|---|---|---|---|---|
| GPT-5 Mini | 0.309 (Oracle-Normalized) | Moderate | Baseline | ASCII > Natural Language |
| PPO | Lower Overall | Dominant | N/A | State Vector |
| LLM + Reasoning Harness | 3.0x â 10.0x Improvement | High | Critical Multiplier | ASCII > Natural Language |
| Human Baseline | Reference | Reference | N/A | Multi-Modal |
Key Insights:
- No Universal Winner: GPT-5 mini achieves the highest overall oracle-normalized score of 0.309, but PPO completely dominates planning-intensive and multi-agent scenarios. This indicates that foundation models excel at broad generalization while RL algorithms retain superiority in structured optimization.
- Reasoning Multiplier: LLM performance is not static; it is highly dependent on the inference harness. Implementing a reasoning wrapper can multiply LLM effectiveness by 3x to 10x, suggesting that raw model capability is less important than the inference-time architecture.
- Modality Efficiency: Contrary to the intuition that LLMs prefer natural language descriptions, ASCII-based observations consistently outperform natural language text. Structured, token-efficient representations reduce hallucination and improve parsing accuracy in sequential contexts.
- Oracle Normalization is Mandatory: Raw reward scores are incomparable across tasks with different scales. Normalizing against oracle reference policies is the only valid method to aggregate performance across diverse capability categories.
Core Solution
To bridge the evaluation gap, a unified benchmark must abstract the agent paradigm while preserving the complexity of sequential decision-making. The solution involves a Gymnasium-compatible interface that standardizes interaction, procedural generation to prevent memorization, and oracle-based normalization for fair comparison.
Architecture Decisions
- Gymnasium Standardization: By adhering to the Gymnasium API, the benchmark allows RL agents, LLM agents, VLMs, and hybrid systems to interact via a common
step()andreset()interface. This eliminates the need for custom wrappers for each agent type. - Procedural Generation: The benchmark includes 37 tasks across six capability categories and four difficulty levels. Tasks are procedurally generated to ensure agents cannot memorize solutions. This forces genuine generalization and sequential reasoning.
- Multi-Modal Observations: The system supports five observation modalities, including ASCII grids, natural language descriptions, images, and state vectors. This enables researchers to test how observation format affects agent performance.
- Oracle Reference Policies: Every task includes an oracle policy that provides the optimal action sequence. Scores are normalized against the oracle to produce a comparable metric across tasks with different reward structures.
- Composable Agent Harness: A modular harness allows researchers to plug in different reasoning strategies, memory modules, and tool-use capabilities without modifying the core environment.
Implementation Example
The following TypeScript/Python hybrid example demonstrates a unified evaluation loop. Note that while the harness logic is shown in TypeScript for type safety, the environment interaction follows the Gymnasium Python standard, which is required for RL compatibility.
// UnifiedAgentHarness.ts
// Type definitions for the benchmark interface
interface Observation {
ascii: string;
naturalLanguage: string;
image?: Uint8Array;
stateVector?: number[];
}
interface StepResult {
observation: Observation;
reward: number;
terminated: boolean;
truncated: boolean;
info: {
oracleScore: number;
capabilityCategory: string;
difficultyLevel: number;
};
}
interface Agent {
reset(seed?: number): void;
act(observation: Observation): number | string;
}
class BenchmarkEvaluator {
private env: a
ny; // Gymnasium environment instance private oraclePolicy: any; private oracleBaseline: Map<string, number>;
constructor(envId: string, oraclePolicy: any) { this.env = envId; this.oraclePolicy = oraclePolicy; this.oracleBaseline = new Map(); }
async initialize(): Promise<void> { // Compute oracle baselines for normalization // In practice, this runs the oracle policy over multiple episodes // to establish the maximum achievable score per task. this.oracleBaseline = await this.computeOracleBaselines(); }
async evaluateAgent(agent: Agent, episodes: number = 1000): Promise<EvaluationReport> { const results: EvaluationReport = { totalEpisodes: episodes, normalizedScore: 0, categoryBreakdown: {}, modalityPerformance: {} };
let totalNormalizedReward = 0;
for (let i = 0; i < episodes; i++) {
const obs = this.env.reset();
let episodeReward = 0;
let done = false;
while (!done) {
const action = agent.act(obs);
const stepResult = this.env.step(action);
episodeReward += stepResult.reward;
done = stepResult.terminated || stepResult.truncated;
obs = stepResult.observation;
}
// Normalize reward against oracle
const taskKey = this.env.unwrapped.spec.id;
const oracleScore = this.oracleBaseline.get(taskKey) || 1.0;
const normalizedReward = episodeReward / oracleScore;
totalNormalizedReward += normalizedReward;
// Aggregate by category and modality
this.aggregateMetrics(results, stepResult.info, normalizedReward);
}
results.normalizedScore = totalNormalizedReward / episodes;
return results;
}
private aggregateMetrics(report: EvaluationReport, info: any, score: number): void { const category = info.capabilityCategory; if (!report.categoryBreakdown[category]) { report.categoryBreakdown[category] = { sum: 0, count: 0 }; } report.categoryBreakdown[category].sum += score; report.categoryBreakdown[category].count += 1; } }
**Rationale:**
* **Type Safety:** The TypeScript interface enforces strict contracts for observations and step results, reducing integration errors when connecting diverse agents.
* **Oracle Normalization:** The `evaluateAgent` method normalizes rewards using pre-computed oracle baselines. This ensures that a score of 0.5 means the same thing regardless of the task's reward scale.
* **Modularity:** The `Agent` interface allows any implementationâwhether a PyTorch PPO model or a GPT-5 mini wrapperâto be evaluated without code changes.
### Pitfall Guide
Evaluating agents across paradigms introduces unique challenges. The following pitfalls are common in production benchmarking and must be avoided.
| Pitfall | Explanation | Fix |
| :--- | :--- | :--- |
| **Modality Mismatch** | Feeding natural language observations to an agent trained on state vectors, or vice versa, leads to catastrophic failure. | Always align the observation modality with the agent's training data. Test multiple modalities to find the optimal format. |
| **Oracle Ignorance** | Comparing raw reward scores across tasks with different scales creates misleading conclusions. | Implement oracle reference policies for all tasks. Normalize all scores against the oracle baseline before aggregation. |
| **Static Evaluation** | Using fixed seeds or pre-generated tasks allows agents to memorize solutions, inflating performance metrics. | Use procedural generation with high variance. Evaluate over thousands of episodes to ensure statistical significance. |
| **Reasoning Neglect** | Evaluating LLMs without a reasoning harness underestimates their capability. Raw LLMs often fail to plan effectively. | Implement a reasoning wrapper that enables chain-of-thought or tree-of-thought inference. Expect 3x-10x performance gains. |
| **ASCII vs NL Bias** | Assuming natural language is superior for LLMs due to their training data. | Test ASCII observations. Structured text often outperforms natural language by reducing token overhead and ambiguity. |
| **Single Metric Trap** | Relying solely on overall score masks weaknesses in specific capability categories. | Decompose results by capability category, difficulty level, and observation modality. |
| **SFT Distribution Shift** | Using pre-built SFT datasets that do not match the evaluation distribution leads to overfitting. | Generate SFT datasets from the same procedural distribution as the evaluation tasks. Validate on held-out seeds. |
### Production Bundle
#### Action Checklist
- [ ] **Define Oracle Policies:** Implement optimal policies for all 37 tasks to establish normalization baselines.
- [ ] **Standardize Interface:** Wrap all environments in a Gymnasium-compatible API to support RL and LLM agents.
- [ ] **Enable Procedural Generation:** Configure task generators to produce infinite variations across six capability categories and four difficulty levels.
- [ ] **Implement Multi-Modal Support:** Ensure the environment can output ASCII, natural language, images, and state vectors based on agent requirements.
- [ ] **Build Reasoning Harness:** Develop a composable reasoning wrapper for LLM agents to enable chain-of-thought inference.
- [ ] **Generate SFT Datasets:** Create supervised fine-tuning datasets using oracle trajectories for post-training foundation models.
- [ ] **Run Large-Scale Evaluation:** Execute at least 90,000 episodes across 27 agent configurations to ensure statistical robustness.
- [ ] **Normalize and Decompose:** Calculate oracle-normalized scores and break down results by category, difficulty, and modality.
#### Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
| :--- | :--- | :--- | :--- |
| **Planning-Intensive Tasks** | PPO or RL Algorithm | RL dominates structured planning and multi-agent coordination. | Low compute cost; high training time. |
| **General Sequential Decision** | GPT-5 Mini with Reasoning | Foundation models offer broad generalization and adaptability. | High API cost; low training time. |
| **Resource-Constrained Env** | SFT-LLM with ASCII | Pre-built SFT datasets reduce inference costs; ASCII improves efficiency. | Medium compute; low latency. |
| **Multi-Modal Perception** | VLM with Image Observations | Visual tasks require image inputs; VLMs handle pixel data effectively. | High compute; moderate latency. |
| **Rapid Prototyping** | Unified Harness + GPT-5 Mini | Fast iteration with reasoning harness; no training required. | High API cost; immediate results. |
#### Configuration Template
Use this YAML configuration to set up a standardized evaluation run. This template ensures consistent normalization, procedural generation, and metric aggregation.
```yaml
benchmark:
id: "unified_seq_dec_bench_v1"
interface: "gymnasium"
tasks:
count: 37
categories:
- "planning"
- "memory"
- "reasoning"
- "coordination"
- "adaptation"
- "generalization"
difficulties: [1, 2, 3, 4]
procedural:
enabled: true
seed_range: [0, 10000]
observations:
modalities:
- "ascii"
- "natural_language"
- "image"
- "state_vector"
- "hybrid"
default: "ascii"
evaluation:
episodes: 10000
parallel_workers: 32
normalization:
method: "oracle"
oracle_policy: "optimal_reference"
agents:
- name: "gpt5_mini_reasoning"
type: "llm"
harness: "reasoning_wrapper"
modality: "ascii"
- name: "ppo_planner"
type: "rl"
algorithm: "PPO"
modality: "state_vector"
metrics:
primary: "oracle_normalized_score"
breakdown:
- "capability_category"
- "difficulty_level"
- "observation_modality"
Quick Start Guide
- Install Dependencies: Set up the Gymnasium environment and benchmark utilities. Ensure Python 3.10+ is available for RL compatibility.
pip install gymnasium benchmark-utils - Register Environment: Register the procedural tasks with the Gymnasium registry using the configuration template.
import gymnasium as gym from benchmark_utils import register_tasks register_tasks(config_path="benchmark_config.yaml") - Load Oracle Policies: Initialize the oracle reference policies for score normalization.
from benchmark_utils import OracleLoader oracle = OracleLoader.load_all() - Instantiate Agent: Create your agent using the unified interface. For LLMs, wrap with a reasoning harness.
from benchmark_utils import LLMReasoningAgent agent = LLMReasoningAgent( model="gpt-5-mini", harness="chain_of_thought", modality="ascii" ) - Run Evaluation: Execute the evaluation loop and generate the normalized report.
from benchmark_utils import BenchmarkEvaluator evaluator = BenchmarkEvaluator(env_id="UnifiedSeqDec-v0", oracle_policy=oracle) report = evaluator.evaluate_agent(agent, episodes=10000) print(f"Normalized Score: {report.normalized_score:.3f}")
This unified approach enables rigorous, comparable evaluation across all agent paradigms. By standardizing the interface, normalizing against oracles, and decomposing results by capability, researchers can accurately measure progress toward general sequential decision-making and identify the most effective architectures for specific use cases.
