I read a multi-agent reasoning paper, built the Claude-native version, and measured everything

By Codcompass Team·2026-06-01·53 min read

RecursiveMAS (arXiv 2604.25917) showed that agents sharing internal reasoning state outperform agents that share only final outputs. The average accuracy gain across benchmarks was 8.3 points. The mechanism: each agent passes not just its answer but the latent embeddings from its own reasoning process, and the next agent conditions on both. The paper is a good result.

The catch is access. RecursiveMAS requires open-weight models with hidden states exposed at inference time. That rules out Claude, GPT-4o, and Gemini. I built a Claude-native version using the Anthropic extended thinking API. The core idea transfers: instead of passing latent vectors, pass the full thinking text. The paper calls it internal state sharing; the Claude version calls it thinking-block relay.

The architecture problem

Claude's extended thinking blocks carry an encrypted signature tied to the originating conversation. You cannot pass a signed thinking block into a different agent's messages array. The API rejects it. The workaround: extract the text from the thinking block and inject it as a regular user message.

# Extract thinking text from Agent 1
thinking_text = next(
    (b.thinking for b in response.content if b.type == "thinking"), ""
)

# Inject into Agent 2 as regular context, not as a thinking block
context = f"Prior agent reasoning:\n{thinking_text}"

Enter fullscreen mode Exit fullscreen mode

The signature does not transfer. The reasoning does.

relay-structured: what I built first

The first architecture was a Planner > Critic > Solver loop where each agent emits a compact mental model JSON instead of raw thinking text. Raw thinking at a 1024-token budget is often compressed and fragmented. The hypothesis was that 150 tokens of structured signal carries more information per token than 1024 tokens of compressed prose.

The schema each agent emits:

{
  "interpretation": "how the agent read the problem",
  "key_steps": ["step 1", "step 2"],
  "rejected_approaches": ["approach tried and discarded"],
  "confidence": 0.85,
  "potenti

al_errors": "where this reasoning might go wrong" }


Enter fullscreen mode Exit fullscreen mode

`confidence` and `potential_errors` are the load-bearing fields. They tell downstream agents where to apply more scrutiny, without requiring those agents to parse a full reasoning trace. A critic that can see "confidence: 0.4, potential\_errors: I may have misread the constraint on x" has a different starting point than one that reads 800 tokens of prose and has to infer the same thing.

### [](#results-n50-preliminary)Results (n=50, preliminary)

Condition

Accuracy

Avg tokens

single-agent

70.0%

1,212

relay-structured

72.0%

18,821

+2 points. 15x token cost. relay-structured wins by one problem out of 50. The direction is right. The cost ratio is not deployable as-is. Running the full Planner > Critic > Solver chain on every request is not justified by 2 points at n=50.

## [](#why-i-did-not-build-readbefore)Why I did not build read-before

The obvious next step: let Agent 2 read Agent 1's JSON before producing its own answer. I skipped it. The problem is anchoring. Agent 2 sees Agent 1's answer before forming its own view, and it will tend to confirm rather than challenge. This is mathematically equivalent to relay-structured with role specialization removed and anchoring added. The expected result is worse, not better. It was not worth building.

## [](#readafter-disagreement-escalation)read-after + disagreement escalation

The design: both agents reason independently. No shared context during reasoning. After both finish, compare their answers in code, no API call. If they agree, return the higher-confidence answer. If they disagree, run a resolver that sees both answers and both mental model JSONs and picks the stronger reasoning chain.

Independent reasoning first means no anchoring. The comparison step is pure code, so there is no token cost when agents agree. The resolver only fires on genuine disagreement, which on a 5000-token budget against hard MATH problems turns out to be about 40% of the time. On easy questions where both agents agree, the cost is 2x single-agent. On harder questions requiring the resolver, it is around 3.5x. Weighted average across both cases: roughly 2.9x single-agent, versus relay-structured's 15x.

[![Self-relay architecture: two independent agents, code disagreement check, resolver on mismatch](https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.jsdelivr.net%2Fgh%2Fbhj37193%2Frelay%404df48ac86da4239c8fb4ea284075b1f3c765c07a%2Fassets%2Fcover.png)](https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.jsdelivr.net%2Fgh%2Fbhj37193%2Frelay%404df48ac86da4239c8fb4ea284075b1f3c765c07a%2Fassets%2Fcover.png)

The core of the implementation:

def run_self_relay(question, n_rounds=2, model=DEFAULT_MODEL, budget_tokens=DEFAULT_BUDGET): # Both agents reason independently, no shared context mm1, answer1, tokens1 = agent_call_structured("solver", question, [], [], model, budget_tokens) mm2, answer2, tokens2 = agent_call_structured("solver", question, [], [], model, budget_tokens)

boxed1 = _extract_boxed(answer1)
boxed2 = _extract_boxed(answer2)
agree = bool(boxed1 and boxed2 and boxed1 == boxed2)

conf1 = float(mm1.get("confidence", 0)) if mm1 else 0.0
conf2 = float(mm2.get("confidence", 0)) if mm2 else 0.0

if agree:
    final_answer = answer1 if conf1 >= conf2 else answer2
    resolver_tokens = 0
else:
    # Resolver sees the problem, both answers, and both reasoning chains
    resolver_prompt = (
        f"Two agents solved a math problem independently and disagreed.\n\n"
        f"Problem: {question}\n\n"
        f"Agent 1 answered: {boxed1}\n"
        f"Agent 1 reasoning: {json.dumps(mm1, indent=2)}\n\n"
        f"Agent 2 answered: {boxed2}\n"
        f"Agent 2 reasoning: {json.dumps(mm2, indent=2)}\n\n"
        f"Evaluate which reasoning chain is stronger. "
        f"Return the correct answer inside \\boxed{{}}."
    )
    response = client.messages.create(
        model=model, max_tokens=8000,
        thinking={"type": "enabled", "budget_tokens": budget_tokens},
        messages=[{"role": "user", "content": resolver_prompt}],
    )
    final_answer = next((b.text for b in response.content if b.type == "text"), "")
    resolver_tokens = response.usage.input_tokens + response.usage.output_tokens

return {
    "final_answer": final_answer,
    "total_tokens": tokens1 + tokens2 + resolver_tokens,
    "agreement": agree,
    "agent1_answer": answer1, "agent1_confidence": conf1,
    "agent2_answer": answer2, "agent2_confidence": conf2,
    "resolver_answer": final_answer if not agree else "",
    "resolver_tokens": resolver_tokens,
}


Enter fullscreen mode Exit fullscreen mode

## [](#results)Results

200 examples, MATH level 4-5, `claude-sonnet-4-6`, budget=5000 tokens, preliminary:

Condition

Accuracy

Avg tokens

single-agent

63.0%

1,234

self-relay

65.5%

3,290

self-relay gains 2.5 points over single-agent at 2.7x the token cost. That is a different profile from relay-structured's 15x cost for 2 points: the read-after architecture gets a similar accuracy gain at roughly one-sixth the token overhead. The disagreement rate on this benchmark was approximately 40%, consistent with the expected range.

The token cost is 2.7x single-agent. relay-structured's was 15x. At the same 5000-token budget, self-relay gets a similar accuracy benefit at a fraction of the cost, because the resolver only fires when needed.

## [](#where-this-fits-in-production)Where this fits in production

Self-relay fits anywhere the cost of a wrong answer exceeds the cost of a second call. Legal document review, code security audits, medical triage, financial analysis. On those tasks, running two independent reasoning chains and only paying for arbitration when they disagree is a straightforward reliability pattern.

The practical deployment is confidence-gating, not always-on relay. Run single-agent first and check the `confidence` field from the mental model JSON. If confidence is above a threshold, return the answer. If not, escalate to self-relay:

def answer_with_confidence_gate(question, threshold=0.75, model="claude-sonnet-4-6", budget=5000): # First pass: single agent with mental model mm, answer, tokens = agent_call_structured("solver", question, [], [], model, budget) confidence = float(mm.get("confidence", 0))

if confidence >= threshold:
    return {"answer": answer, "method": "single", "total_tokens": tokens}

# Low confidence: escalate to self-relay
result = run_self_relay(question, model=model, budget_tokens=budget)
result["method"] = "self-relay"
return result


Enter fullscreen mode Exit fullscreen mode

On a real workload where roughly 25-30% of questions fall below threshold, this brings the average token overhead from 2.7x down to roughly 1.3-1.5x single-agent. The relay only runs on requests that actually need it.

## [](#what-i-would-try-next)What I would try next

The n=500 threshold was not triggered (65.5% is below 72.0%), so these results stay at n=200. The next useful test is calibrating the confidence threshold: at what level does escalating to self-relay recover correct answers, and what is the false-escalation rate? A domain-specific benchmark, legal review or code analysis, would also stress-test whether the disagreement pattern and resolver accuracy hold outside of competition math.

The repo is at [github.com/bhj37193/relay](https://github.com/bhj37193/relay). The eval harness is in `relay/eval_structured.py`. All results are in `eval_structured_results.json`.

🎉 Mid-Year Sale — Unlock Full Article

Base plan from just $4.99/mo or $49/yr

7-day free trial · Cancel anytime · 30-day money-back

The architecture problem

relay-structured: what I built first

🎉 Mid-Year Sale — Unlock Full Article

Production Bundle