Gemma 4 가 갑자기 답을 못 했다 — 외부 협업이 24시간 만에 root cause 찾아낸 이야기
Diagnosing Deterministic Empty Outputs in Multi-Stage LLM Pipelines: The Hidden Reasoning Floor
Current Situation Analysis
Multi-stage LLM workflows—particularly Graph-RAG systems, cognitive middleware, and agentic planners—frequently encounter a silent failure mode: specific pipeline stages return zero-length responses while others execute normally. Engineering teams typically interpret this as a model capability limitation, prompt engineering flaw, or architectural incompatibility (e.g., Dense vs. MoE variants). In reality, the failure is almost always a deterministic budget constraint masquerading as a reasoning deficit.
The problem is systematically overlooked because modern LLM providers abstract token consumption behind simple max_tokens parameters. Developers assume that setting a cap guarantees termination, not realizing that many contemporary models allocate a fixed portion of that budget to internal reasoning before emitting a single visible token. When the cap falls below this internal threshold, the model terminates cleanly without output, producing an empty string. This behavior is highly deterministic, stage-agnostic, and completely reversible by adjusting the token budget.
Empirical evidence from production deployments confirms this pattern. In a documented Graph-RAG pipeline using gemma4:e4b (4B parameters), four cognitive stages (query rewriting, planning, critique, and fact-checking) exhibited inconsistent behavior. Two stages returned valid outputs, while two consistently produced empty responses. Switching to a larger model (gemma3:12b) resolved the issue, but masked the underlying mechanism. Cross-validation across three independent deployment contexts—local Ollama, managed Gemini API, and sovereign Ollama instances—isolated the root cause: a hidden reasoning floor of approximately 500 tokens. When max_tokens was set below this threshold, visible output was mathematically impossible. Raising the cap to 4096 restored 100% success rates across all environments without modifying prompts or architecture.
WOW Moment: Key Findings
The critical insight is that LLMs do not begin generating visible text until they have satisfied an internal reasoning budget. This creates a hard floor that dictates minimum token allocation per stage. Below is a comparative analysis of how different token configurations impact pipeline behavior:
| Approach | Success Rate | Avg Latency | Visible Output | Root Cause |
|---|---|---|---|---|
| Cap 200 tokens | 0% | 2.1s | Empty | Budget starvation; reasoning floor unmet |
| Cap 400 tokens | 0% | 4.3s | Empty | Linear latency scaling; still below 500-token floor |
| Cap 4096 tokens | 100% | 5.3s–7.1s | Full response | Reasoning floor satisfied; visible generation proceeds |
This finding matters because it shifts the debugging paradigm from prompt engineering and model selection to token budget accounting. Engineers can now predict empty responses mathematically: if max_tokens < hidden_reasoning_floor, output will deterministically be zero. The ~500-token floor observed in gemma4:e4b represents a hidden-to-visible token ratio of roughly 5:1 to 6:1 for single-stage operations. This is a model-level characteristic, not a prompt artifact. Recognizing this enables precise budget allocation, eliminates guesswork in stage configuration, and prevents unnecessary model upgrades or prompt restructuring.
Core Solution
Resolving deterministic empty outputs requires a systematic approach to token budget measurement, per-stage configuration, and pipeline validation. The following implementation demonstrates how to instrument token consumption, enforce minimum floors, and maintain production stability.
Step 1: Instrument Token Consumption
Before adjusting caps, measure how many tokens the model consumes internally before emitting visible output. This requires capturing both the requested max_tokens and the actual completion tokens used.
interface StageMetrics {
requestedCap: number;
consumedTokens: number;
visibleTokens: number;
latencyMs: number;
isEmpty: boolean;
}
async function measureReasoningFloor(
model: string,
prompt: string,
testCaps: number[]
): Promise<StageMetrics[]> {
const results: StageMetrics[] = [];
for (const cap of testCaps) {
const start = Date.now();
const response = await llmClient.generate(model, prompt, {
maxTokens: cap,
temperature: 0.1,
});
const latency = Date.now() - start;
results.push({
requestedCap: cap,
consumedTokens: response.usage?.totalTokens ?? 0,
visibleTokens: response.usage?.completionTokens ?? 0,
latencyMs: latency,
isEmpty: response.text.trim().length === 0,
});
}
return results;
}
Step 2: Implement Per-Stage Budget Configuration
Hardcoding global token limits causes cross-stage failures. Instead, define explicit budgets per pipeline stage, accounting for the model's hidden reasoning floor.
type PipelineStage = 'QUERY_REWRITE' | 'PLANNING' | 'CRITIQUE' | 'FACT_CHECK';
interface StageConfig {
stage: PipelineStage;
minTokens: number; // Hidden reasoning floor + safety margin
maxTokens: number; // Upper bound for visible output
timeoutMs: number;
fallbackModel?: string;
}
const STAGE_BUDGETS: Record<PipelineStage, StageConfig> = {
QUERY_REWRITE: {
stage: 'QUERY_REWRITE',
minTokens: 600, // ~500 floor + 20% margin
maxTokens: 4096,
timeoutMs: 8000,
},
PLANNING: {
stage: 'PLANNING',
minTokens: 700, // Planning requires deeper reasoning
maxTokens: 4096,
timeoutMs: 10000,
},
CRITIQUE: {
stage: 'CRITIQUE',
minTokens: 650,
maxTokens: 4096,
timeoutMs: 9000,
},
FACT_CHECK: {
stage: 'FACT_CHECK',
minTokens: 600,
maxTokens: 4096,
timeoutMs: 8000,
},
};
Step 3: Enforce Budget Validation at Runtime
Prevent pipeline execution if the configured cap falls below the measured floor. This acts as a circuit breaker against deterministic empty outputs.
class TokenBudgetValidator {
constructor(private model: string, private measuredFloor: number) {}
validate(config: StageConfig): void {
if (config.maxTokens < this.measuredFloor) {
throw new Error(
`Budget starvation detected for ${config.stage}. ` +
`Configured maxTokens (${config.maxTokens}) is below ` +
`model reasoning floor (${this.measuredFloor}). ` +
`Visible output will be deterministic empty.`
);
}
}
getRecommendedCap(): number {
// 4096 provides sufficient headroom for reasoning + visible output
return Math.max(this.measuredFloor * 8, 4096);
}
}
Architecture Decisions & Rationale
- Why 4096 tokens? It exceeds the ~500-token hidden floor by a factor of 8, ensuring ample space for visible generation while remaining cost-effective. It aligns with standard model context windows and avoids unnecessary overhead.
- Why per-stage budgets? Different cognitive operations require varying reasoning depths. Planning and critique typically consume more internal tokens than query rewriting. Uniform caps cause silent failures in deeper stages.
- Why explicit validation? Runtime checks catch configuration drift before deployment. The error message explicitly references the reasoning floor, guiding maintainers to the correct fix without requiring deep model internals knowledge.
- Why TypeScript? Strong typing prevents accidental cap misconfiguration across pipeline stages. Interface contracts enforce consistency in multi-developer environments.
Pitfall Guide
1. Misattributing Empty Outputs to Model Incompetence
Explanation: Teams often assume a 4B model lacks the capacity for complex reasoning when stages return empty strings. This leads to unnecessary model upgrades or prompt simplification.
Fix: Always verify token budgets before evaluating model capability. If max_tokens < ~500, the empty response is a budget artifact, not a capability limit.
2. Ignoring the Hidden Reasoning Floor
Explanation: Developers treat max_tokens as a visible output limit, unaware that models allocate a fixed portion to internal processing. This causes deterministic failures when caps are set too low.
Fix: Measure the hidden-to-visible token ratio empirically. Run controlled tests with incrementing caps until visible output appears. Document the floor per model.
3. Hardcoding Global Token Limits
Explanation: Applying a single max_tokens value across all pipeline stages ignores varying reasoning requirements. Shallow stages may succeed while deep stages fail silently.
Fix: Implement per-stage budget configuration. Audit each stage's token consumption independently and set minimums based on measured floors.
4. Confusing Latency Scaling with Model Speed
Explanation: Linear latency increases with higher token caps are often misinterpreted as model slowness. In reality, latency scales proportionally to the reasoning budget consumed. Fix: Correlate latency with token consumption. If latency doubles when cap doubles but output remains empty, the model is burning budget on hidden reasoning, not generating text.
5. Overlooking Cross-Stage Budget Mismatches
Explanation: Pipeline orchestrators may pass inconsistent token limits between stages, causing intermittent failures that appear random. Fix: Centralize budget configuration in a single source of truth. Validate all stages against the model's minimum floor before pipeline initialization.
6. Skipping Regression Validation After Cap Adjustments
Explanation: Raising token caps can inadvertently increase costs or latency. Teams often deploy changes without verifying downstream impact. Fix: Run deterministic regression suites post-change. Monitor cost-per-query and latency percentiles. Implement budget alerts if consumption exceeds thresholds.
7. Assuming Architecture Differences Cause Behavior Variance
Explanation: Dense vs. MoE models are frequently blamed for inconsistent outputs. However, token budget constraints affect all architectures identically. Fix: Isolate variables by testing the same model across different caps before comparing architectures. Token budget is the primary determinant of empty outputs, not model topology.
Production Bundle
Action Checklist
- Measure hidden reasoning floor: Run controlled tests with incrementing
max_tokens(200, 400, 600, 1000) to identify the threshold where visible output appears. - Audit all pipeline stages: Verify that every cognitive stage's
max_tokensconfiguration exceeds the measured floor by at least 20%. - Implement runtime validation: Add budget checks that prevent pipeline execution if caps fall below the reasoning floor.
- Centralize configuration: Move token budgets to a single configuration file or environment variable set to prevent drift.
- Run regression suite: Execute deterministic test queries across all stages post-change to confirm 100% success rates.
- Monitor cost and latency: Track token consumption and response times to ensure budget increases remain within acceptable operational bounds.
- Document model-specific floors: Maintain a registry of hidden reasoning thresholds per model variant to guide future deployments.
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|---|---|---|
| Low-latency batch processing | Cap at 2048 tokens | Balances reasoning floor compliance with throughput requirements | Moderate increase (~15-20%) |
| High-reasoning agent workflow | Cap at 4096 tokens | Ensures deep cognitive stages complete without starvation | Higher increase (~30-40%) |
| Mixed-model pipeline | Per-stage dynamic budgets | Prevents cross-stage failures while optimizing per-model floors | Variable; optimized per stage |
| Cost-constrained deployment | Cap at 1024 tokens + fallback model | Meets minimum floor while capping spend; triggers fallback if budget exhausted | Minimal increase; predictable |
Configuration Template
// pipeline.config.ts
export const MODEL_BUDGET_REGISTRY: Record<string, number> = {
'gemma4:e4b': 500,
'gemma3:12b': 350,
'llama3.1:8b': 400,
};
export const PIPELINE_STAGE_CONFIGS: StageConfig[] = [
{
stage: 'QUERY_REWRITE',
minTokens: 600,
maxTokens: 4096,
timeoutMs: 8000,
},
{
stage: 'PLANNING',
minTokens: 700,
maxTokens: 4096,
timeoutMs: 10000,
},
{
stage: 'CRITIQUE',
minTokens: 650,
maxTokens: 4096,
timeoutMs: 9000,
},
{
stage: 'FACT_CHECK',
minTokens: 600,
maxTokens: 4096,
timeoutMs: 8000,
},
];
export function validatePipelineConfig(model: string): void {
const floor = MODEL_BUDGET_REGISTRY[model];
if (!floor) {
throw new Error(`Unknown model: ${model}. Register reasoning floor first.`);
}
PIPELINE_STAGE_CONFIGS.forEach((stage) => {
if (stage.maxTokens < floor) {
throw new Error(
`Stage ${stage.stage} budget (${stage.maxTokens}) below floor (${floor}).`
);
}
});
}
Quick Start Guide
- Identify your model's reasoning floor: Run a test script with
max_tokensvalues of 200, 400, 600, and 1000. Record the lowest cap that produces visible output. This is your hidden reasoning floor. - Update stage configurations: Set
minTokenstofloor * 1.2andmaxTokensto4096for all pipeline stages. Apply the configuration template above. - Deploy validation layer: Integrate the
TokenBudgetValidatorinto your pipeline orchestrator. Ensure it throws explicit errors if caps fall below the floor. - Run regression tests: Execute your standard query suite across all stages. Verify 100% success rates and monitor latency/cost deltas.
- Monitor in production: Track token consumption per stage. Set alerts if average consumption exceeds
maxTokens * 0.8to prevent future budget starvation.
Mid-Year Sale — Unlock Full Article
Base plan from just $4.99/mo or $49/yr
Sign in to read the full article and unlock all tutorials.
Sign In / Register — Start Free Trial7-day free trial · Cancel anytime · 30-day money-back
