"Return JSON only" doesn't force JSON. Here's what actually forces it.

By Codcompass Team·2026-05-06·5 min read

Return JSON only doesn't force JSON. Here's what actually forces it.

Current Situation Analysis

Production LLM pipelines frequently fail when relying on prompt instructions like "Return JSON only. No preamble, no explanation." to enforce output structure. While these instructions work reliably in testing and staging, they break in production when the model injects conversational acknowledgments (e.g., "Sure! Here's my evaluation:") before the JSON object. This triggers json.loads() exceptions that are often caught silently, returning None to downstream logic. The pipeline continues running with corrupted evaluation scores, causing silent data degradation across hundreds of requests before detection.

The root failure mode lies in how LLMs process format instructions. Prompt-based directives operate as soft mechanisms that only shift the probability distribution over the next token. During training, the model associates such phrasing with JSON-shaped tokens, increasing their probability mass to 95–99% on well-tuned models. However, probabilistic shifting does not eliminate invalid outcomes. At any temperature above 0, sampling can select low-probability preamble tokens. Even at temperature 0 (deterministic argmax selection), if contextual factors (long system prompts, conversational user input, or helpfulness-aligned fine-tuning) push the preamble token to the highest probability, the model will deterministically output it. Instruction-following provides no hard guarantees; it merely biases the distribution.

WOW Moment: Key Findings

Approach	Parse Success Rate	Latency Overhead	Failure Mode
Soft Prompting	95–99%	~0 ms	Silent `None` propagation & downstream corruption
Constrained Decoding	~100% (excluding safety refusals)	~5–15 ms	Hard type-safe guarantee; explicit boundary handling required

Constrained decoding shifts the paradigm from probabilistic bias to deterministic exclusion. By compiling a JSON schema into a finite-state machine and masking invalid logits to negative infini

ty at each decoding step, the inference engine physically prevents token sequences that violate the schema. The approach operates at O(1) time complexity per token, making it production-viable without meaningful latency penalties.

Core Solution

Constrained decoding (also known as structured generation or grammar-guided sampling) enforces output validity at the inference layer, prior to sampling. The mechanism compares the current partial output against a formal grammar or schema at every decoding step. Any token that would transition the parser into an invalid state has its logit set to negative infinity, effectively removing it from the probability space. The model cannot produce that token. This is implemented in production via:

Outlines — Reference implementation compiling schemas to FSMs for O(1) vocabulary masking
llama.cpp — --grammar-file flag using GBNF grammar format
OpenAI Structured Outputs — response_format: { type: "json_schema", json_schema: {...} } with token-level schema enforcement

The soft approach — what most pipelines do:

import json

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{
        "role": "user",
        "content": 'Evaluate this response. Return JSON only: {"score": int, "reason": str}'
    }]
)

try:
    result = json.loads(response.choices[0].message.content)
except json.JSONDecodeError:
    result = None  # silent failure — downstream receives None and keeps running

Enter fullscreen mode Exit fullscreen mode The try/except here is necessary but not sufficient. Catching the error and returning None just defers the damage — whatever uses result now has to handle None everywhere, and if it doesn't, the failure propagates silently and corrupts your scores.

The hard approach — schema enforced at the token level:

from pydantic import BaseModel

class Evaluation(BaseModel):
    score: int
    reason: str

response = client.beta.chat.completions.parse(
    model="gpt-4o-2024-08-06",
    messages=[{"role": "user", "content": "Evaluate this response."}],
    response_format=Evaluation,
)

result = response.choices[0].message.parsed  # always a valid Evaluation — never None

Enter fullscreen mode Exit fullscreen mode No try/except on the parse. No None propagation. result is always a typed Evaluation object because the schema was enforced at the token level before the response was ever assembled.

Architecture & Boundary Hardening: In production judge pipelines, the immediate fix involves boundary parsing hardening while the primary migration targets token-level enforcement:

def _safe_parse_json(raw: str) -> dict:
    # Strip common preamble patterns before attempting parse
    start = raw.find("{")
    end = raw.rfind("}") + 1
    if start == -1 or end == 0:
        raise ValueError(f"No JSON object found in output: {repr(raw[:120])}")
    stripped = raw[start:end]
    try:
        return json.loads(stripped)
    except json.JSONDecodeError as e:
        raise ValueError(f"Judge returned unparseable output: {repr(raw[:120])}") from e

Enter fullscreen mode Exit fullscreen mode The stripping logic serves exclusively as a fallback for open-weight model calls lacking native schema enforcement. For the primary judge endpoint, response_format with a Pydantic schema guarantees parse reliability. Model documentation must also be updated to reflect that output determinism derives from inference-layer constraints, not prompt engineering, preventing regression during model swaps.

Pitfall Guide

Relying on Prompt Instructions for Hard Guarantees: Format directives only shift next-token probability distributions. They cannot exclude invalid outcomes or override contextual nudges that increase preamble likelihood.
Silent None Propagation: Catching JSONDecodeError and returning None defers failure to downstream consumers. If downstream logic lacks explicit None handling, evaluation scores and business metrics become silently corrupted.
Assuming Deterministic Decoding Eliminates Risk: Temperature 0 selects the argmax token deterministically, but if contextual factors make a preamble token the highest probability, the model will still output it. Determinism does not equal correctness.
Ignoring Safety Refusals & Content Filters: Even with constrained decoding, API-level safety refusals or content filters can bypass schema enforcement and return non-schema responses. Boundary code must explicitly handle refusal states.
Treating Regex/Stripping Fallbacks as Primary Logic: String manipulation fallbacks are inherently fragile and should only be used for open-weight models without native schema support. Relying on them for primary inference paths reintroduces parsing latency and edge-case failures.
Outdated Model Documentation: Failing to update model cards and architecture docs to specify that reliability comes from constrained decoding (not prompt engineering) causes silent regressions when teams swap base models or adjust inference configurations.

Deliverables

📘 Constrained Decoding Migration Blueprint A step-by-step architectural guide for transitioning from soft-prompt JSON parsing to token-level schema enforcement. Covers FSM compilation workflows, Pydantic schema design patterns, fallback strategy implementation for open-weight models, and latency profiling for O(1) logit masking.

✅ Production JSON Pipeline Hardening Checklist

Define strict Pydantic/JSON Schema for all LLM outputs
Implement token-level schema enforcement (response_format or grammar files)
Add explicit boundary handling for safety refusals & content filters
Replace silent None returns with explicit error boundaries or typed exceptions
Configure fallback stripping logic only for non-enforced model endpoints
Update model cards & runbooks to document inference-layer constraints vs prompt engineering
Add monitoring alerts for schema validation failures & parse latency spikes

⚙️ Configuration Templates

OpenAI Structured Outputs setup (client.beta.chat.completions.parse with Pydantic models)
llama.cpp GBNF grammar file structure for JSON schema enforcement
Outlines library integration snippet for custom FSM compilation and logit masking

🎉 Mid-Year Sale — Unlock Full Article

Base plan from just $4.99/mo or $49/yr

7-day free trial · Cancel anytime · 30-day money-back

Return JSON only doesn't force JSON. Here's what actually forces it.

Current Situation Analysis

WOW Moment: Key Findings

🎉 Mid-Year Sale — Unlock Full Article

Production Bundle