Diagnosing Deterministic Empty Outputs in Multi-Stage LLM Pipelines: The Hidden Reasoning Floor

Current Situation Analysis

Multi-stage LLM workflows—particularly Graph-RAG systems, cognitive middleware, and agentic planners—frequently encounter a silent failure mode: specific pipeline stages return zero-length responses while others execute normally. Engineering teams typically interpret this as a model capability limitation, prompt engineering flaw, or architectural incompatibility (e.g., Dense vs. MoE variants). In reality, the failure is almost always a deterministic budget constraint masquerading as a reasoning deficit.

The problem is systematically overlooked because modern LLM providers abstract token consumption behind simple max_tokens parameters. Developers assume that setting a cap guarantees termination, not realizing that many contemporary models allocate a fixed portion of that budget to internal reasoning before emitting a single visible token. When the cap falls below this internal threshold, the model terminates cleanly without output, producing an empty string. This behavior is highly deterministic, stage-agnostic, and completely reversible by adjusting the token budget.

Empirical evidence from production deployments confirms this pattern. In a documented Graph-RAG pipeline using gemma4:e4b (4B parameters), four cognitive stages (query rewriting, planning, critique, and fact-checking) exhibited inconsistent behavior. Two stages returned valid outputs, while two consistently produced empty responses. Switching to a larger model (gemma3:12b) resolved the issue, but masked the underlying mechanism. Cross-validation across three independent deployment contexts—local Ollama, managed Gemini API, and sovereign Ollama instances—isolated the root cause: a hidden reasoning floor of approximately 500 tokens. When max_tokens was set below this threshold, visible output was mathematically impossible. Raising the cap to 4096 restored 100% success rates across all environments without modifying prompts or architecture.

WOW Moment: Key Findings

The critical insight is that LLMs do not begin generating visible text until they have satisfied an internal reasoning budget. This creates a hard floor that dictates minimum token allocation per stage. Below is a comparative analysis of how different token configurations impact pipeline behavior:

Approach	Success Rate	Avg Latency	Visible Output	Root Cause
Cap 200 tokens	0%	2.1s	Empty	Budget starvation; reasoning floor unmet
Cap 400 tokens	0%	4.3s	Empty	Linear latency scaling; still below 500-token floor
Cap 4096 tokens	100%	5.3s–7.1s	Full response	Reasoning floor satisfied; visible generation proceeds

This finding matters because it shifts the debugging paradigm from prompt engineering and model selection to token budget accounting. Engineers can now predict empty responses mathematically: if max_tokens < hidden_reasoning_floor, output will deterministically be zero. The ~500-token floor observed in gemma4:e4b represents a hidden-to-visible token ratio of roughly 5:1 to 6:1 for single-stage operations. This is a model-level characteristic, not a prompt artifact. Recognizing this enables precise budget allocation, eliminates guesswork in stage configuration, and prevents unnecessary model upgrades or prompt restructuring.

Core Solution

Resolving deterministic empty outputs requires a systematic approach to token budget measurement, per-stage configuration, and pipeline validation. The following implementation demonstrates how to instrument token consumption, enforce minimum floors, and maintain production stability.

Step 1: Instrument Token Consumption

Before adjusting caps, measure how many tokens the model consumes internally before emitting visible output. This requires capturing both the requested max_tokens and the actual completion tokens used.

interface StageMetrics {
  requestedCap: number;
  consumedTokens: number;
  visibleTokens: number;
  latencyMs: number;
  isEmpty: boolean;
}

async function measureReasoningFloor(
  model: string,
  prompt: string,
  testCaps: number[]
): Promise<StageMetrics[]> {
  const results: StageMetrics[] = [];

  for (const cap of testCaps) {
    const start = Date.now();
    const response = await llmClient.generate(model, prompt, {
      maxTokens: cap,
      temperature: 0.1,
    });
    const latency = Date.now() - start;

    results.push({
      requestedCap: cap,
      consumedTokens: response.usage?.totalTokens ?? 0,
      visibleTokens: response.usage?.completionTokens ?? 0,
      latencyMs: latency,
      isEmpty: response.text.trim().length === 0,
    });
  }

  return results;
}

Step 2: Implement Per-Stage Budget Configuration

Hardcoding global token limits causes cross-stage failures. Instead, define explicit budgets per pipeline stage, accounting for the model's hidden reasoning floor.

type PipelineStage = 'QUERY_REWRITE' | 'PLANNING' | 'CRITIQUE' | 'FACT_CHECK';

interface StageConfig {
  stage: PipelineStage;
  minTokens: number;   // Hidden reasoning floor + safety margin
  maxTokens: number;   // Upper bound for visible output
  timeoutMs: number;
  fallbackModel?: string;
}

const STAGE_BUDGETS: Record<PipelineStage, StageConfig> = {
  QUERY_REWRITE: {
    stage: 'QUERY_REWRITE',
    minTokens: 600,    // ~500 floor + 20% margin
    maxTokens: 4096,
    timeoutMs: 8000,
  },
  PLANNING: {
    stage: 'PLANNING',
    minTokens: 700,    // Planning requires deeper reasoning
    maxTokens: 4096,
    timeoutMs: 10000,
  },
  CRITIQUE: {
    stage: 'CRITIQUE',
    minTokens: 650,
    maxTokens: 4096,
    timeoutMs: 9000,
  },
  FACT_CHECK: {
    stage: 'FACT_CHECK',
    minTokens: 600,
    maxTokens: 4096,
    timeoutMs: 8000,
  },
};

Step 3: Enforce Budget Validation at Runtime

Prevent pipeline execution if the configured cap falls below the measured floor. This acts as a circuit breaker against deterministic empty outputs.

class TokenBudgetValidator {
  constructor(private model: string, private measuredFloor: number) {}

  validate(config: StageConfig): void {
    if (config.maxTokens < this.measuredFloor) {
      throw new Error(
        `Budget starvation detected for ${config.stage}. ` +
        `Configured maxTokens (${config.maxTokens}) is below ` +
        `model reasoning floor (${this.measuredFloor}). ` +
        `Visible output will be deterministic empty.`
      );
    }
  }

  getRecommendedCap(): number {
    // 4096 provides sufficient headroom for reasoning + visible output
    return Math.max(this.measuredFloor * 8, 4096);
  }
}

Architecture Decisions & Rationale

Why 4096 tokens? It exceeds the ~500-token hidden floor by a factor of 8, ensuring ample space for visible generation while remaining cost-effective. It aligns with standard model context windows and avoids unnecessary overhead.
Why per-stage budgets? Different cognitive operations require varying reasoning depths. Planning and critique typically consume more internal tokens than query rewriting. Uniform caps cause silent failures in deeper stages.
Why explicit validation? Runtime checks catch configuration drift before deployment. The error message explicitly references the reasoning floor, guiding maintainers to the correct fix without requiring deep model internals knowledge.
Why TypeScript? Strong typing prevents accidental cap misconfiguration across pipeline stages. Interface contracts enforce consistency in multi-developer environments.

Pitfall Guide

1. Misattributing Empty Outputs to Model Incompetence

Explanation: Teams often assume a 4B model lacks the capacity for complex reasoning when stages return empty strings. This leads to unnecessary model upgrades or prompt simplification. Fix: Always verify token budgets before evaluating model capability. If max_tokens < ~500, the empty response is a budget artifact, not a capability limit.

2. Ignoring the Hidden Reasoning Floor

Explanation: Developers treat max_tokens as a visible output limit, unaware that models allocate a fixed portion to internal processing. This causes deterministic failures when caps are set too low. Fix: Measure the hidden-to-visible token ratio empirically. Run controlled tests with incrementing caps until visible output appears. Document the floor per model.

3. Hardcoding Global Token Limits

Explanation: Applying a single max_tokens value across all pipeline stages ignores varying reasoning requirements. Shallow stages may succeed while deep stages fail silently. Fix: Implement per-stage budget configuration. Audit each stage's token consumption independently and set minimums based on measured floors.

4. Confusing Latency Scaling with Model Speed

Explanation: Linear latency increases with higher token caps are often misinterpreted as model slowness. In reality, latency scales proportionally to the reasoning budget consumed. Fix: Correlate latency with token consumption. If latency doubles when cap doubles but output remains empty, the model is burning budget on hidden reasoning, not generating text.

5. Overlooking Cross-Stage Budget Mismatches

Explanation: Pipeline orchestrators may pass inconsistent token limits between stages, causing intermittent failures that appear random. Fix: Centralize budget configuration in a single source of truth. Validate all stages against the model's minimum floor before pipeline initialization.

6. Skipping Regression Validation After Cap Adjustments

Explanation: Raising token caps can inadvertently increase costs or latency. Teams often deploy changes without verifying downstream impact. Fix: Run deterministic regression suites post-change. Monitor cost-per-query and latency percentiles. Implement budget alerts if consumption exceeds thresholds.

7. Assuming Architecture Differences Cause Behavior Variance

Explanation: Dense vs. MoE models are frequently blamed for inconsistent outputs. However, token budget constraints affect all architectures identically. Fix: Isolate variables by testing the same model across different caps before comparing architectures. Token budget is the primary determinant of empty outputs, not model topology.

Production Bundle

Action Checklist

Measure hidden reasoning floor: Run controlled tests with incrementing max_tokens (200, 400, 600, 1000) to identify the threshold where visible output appears.
Audit all pipeline stages: Verify that every cognitive stage's max_tokens configuration exceeds the measured floor by at least 20%.
Implement runtime validation: Add budget checks that prevent pipeline execution if caps fall below the reasoning floor.
Centralize configuration: Move token budgets to a single configuration file or environment variable set to prevent drift.
Run regression suite: Execute deterministic test queries across all stages post-change to confirm 100% success rates.
Monitor cost and latency: Track token consumption and response times to ensure budget increases remain within acceptable operational bounds.
Document model-specific floors: Maintain a registry of hidden reasoning thresholds per model variant to guide future deployments.

Decision Matrix

Scenario	Recommended Approach	Why	Cost Impact
Low-latency batch processing	Cap at 2048 tokens	Balances reasoning floor compliance with throughput requirements	Moderate increase (~15-20%)
High-reasoning agent workflow	Cap at 4096 tokens	Ensures deep cognitive stages complete without starvation	Higher increase (~30-40%)
Mixed-model pipeline	Per-stage dynamic budgets	Prevents cross-stage failures while optimizing per-model floors	Variable; optimized per stage
Cost-constrained deployment	Cap at 1024 tokens + fallback model	Meets minimum floor while capping spend; triggers fallback if budget exhausted	Minimal increase; predictable

Configuration Template

// pipeline.config.ts
export const MODEL_BUDGET_REGISTRY: Record<string, number> = {
  'gemma4:e4b': 500,
  'gemma3:12b': 350,
  'llama3.1:8b': 400,
};

export const PIPELINE_STAGE_CONFIGS: StageConfig[] = [
  {
    stage: 'QUERY_REWRITE',
    minTokens: 600,
    maxTokens: 4096,
    timeoutMs: 8000,
  },
  {
    stage: 'PLANNING',
    minTokens: 700,
    maxTokens: 4096,
    timeoutMs: 10000,
  },
  {
    stage: 'CRITIQUE',
    minTokens: 650,
    maxTokens: 4096,
    timeoutMs: 9000,
  },
  {
    stage: 'FACT_CHECK',
    minTokens: 600,
    maxTokens: 4096,
    timeoutMs: 8000,
  },
];

export function validatePipelineConfig(model: string): void {
  const floor = MODEL_BUDGET_REGISTRY[model];
  if (!floor) {
    throw new Error(`Unknown model: ${model}. Register reasoning floor first.`);
  }

  PIPELINE_STAGE_CONFIGS.forEach((stage) => {
    if (stage.maxTokens < floor) {
      throw new Error(
        `Stage ${stage.stage} budget (${stage.maxTokens}) below floor (${floor}).`
      );
    }
  });
}

Quick Start Guide

Identify your model's reasoning floor: Run a test script with max_tokens values of 200, 400, 600, and 1000. Record the lowest cap that produces visible output. This is your hidden reasoning floor.
Update stage configurations: Set minTokens to floor * 1.2 and maxTokens to 4096 for all pipeline stages. Apply the configuration template above.
Deploy validation layer: Integrate the TokenBudgetValidator into your pipeline orchestrator. Ensure it throws explicit errors if caps fall below the floor.
Run regression tests: Execute your standard query suite across all stages. Verify 100% success rates and monitor latency/cost deltas.
Monitor in production: Track token consumption per stage. Set alerts if average consumption exceeds maxTokens * 0.8 to prevent future budget starvation.

Gemma 4 가 갑자기 답을 못 했다 — 외부 협업이 24시간 만에 root cause 찾아낸 이야기