Laravel Horizon in Production: Configuring AI Queue Workloads That Actually Hold
Architecting Resilient LLM Pipelines in Laravel: Queue Supervisor Tuning for Long-Running Inference
Current Situation Analysis
Traditional queue architectures were engineered for deterministic, short-lived tasks. Email dispatches, image resizing, and database synchronization typically complete within milliseconds to a few seconds. Laravel Horizon inherits these assumptions by default: a 60-second execution window, three retry attempts with zero delay, and scaling logic driven by queue depth. When you introduce generative AI workloads, these defaults become operational liabilities.
LLM inference operates on fundamentally different timing characteristics. A claude-sonnet-4-6 request with a dense system prompt and extended context window frequently approaches 45 seconds before streaming begins. Batch summarization tasks routed through gemini-2.5-pro can easily exceed two minutes under concurrent load. OpenAI's gpt-4o exhibits similar variance depending on token volume and network routing. The mismatch between queue expectations and inference reality creates three critical failure patterns:
- Silent Process Termination: When Horizon's 60-second supervisor timeout triggers, the worker receives a
SIGKILL. The operating system terminates the process immediately. No Laravel exception is caught, nofailed_jobsrecord is created, and the job vanishes from observability. Teams report "disappearing jobs" because the failure occurs below the application layer. - Rate Limit Budget Exhaustion: Provider
429 Too Many Requestsresponses are transient scheduling signals, not application errors. Laravel's default retry behavior attempts immediate re-queuing. Without explicit backoff configuration, a single rate-limited request can consume all five retry attempts in under 15 seconds, permanently failing a job that would have succeeded with a 30-second pause. - Partial State Discard: Inference pipelines often perform expensive preprocessing, chunking, or context assembly before the API call. When a job fails mid-execution, standard failure handlers wipe the database record. For long-document processing, this means discarding 80% of the work and incurring full retry costs.
These issues are routinely overlooked because developers configure AI jobs using the same patterns as notification dispatchers. The queue system is treated as a black box rather than a resource scheduler that requires workload-specific tuning.
WOW Moment: Key Findings
The operational divergence between standard task queues and AI inference pipelines becomes quantifiable when measuring execution windows, retry behavior, scaling triggers, and failure recovery.
| Configuration Dimension | Standard Queue Defaults | AI-Optimized Horizon Setup | Operational Impact |
|---|---|---|---|
| Execution Window | 60 seconds | 240–300 seconds | Prevents silent SIGKILL termination during token streaming |
| Retry Strategy | 3 attempts, 0s delay | 5 attempts, exponential backoff (30–240s) | Preserves retry budget against transient 429 rate limits |
| Scaling Signal | Queue length (job count) | Queue wait time (seconds) | Aligns worker provisioning with actual latency, not arbitrary depth |
| Failure Recovery | Full state reset | Partial state preservation + error tagging | Reduces redundant compute costs and enables resume-capable pipelines |
| Process Manager Grace | 10 seconds (stopwaitsecs) | 360 seconds | Prevents deployment-time truncation of in-flight inference calls |
This comparison reveals that AI workloads require a scheduler, not just a queue. Time-based scaling catches latency spikes before they cascade into user-facing timeouts. Exponential backoff transforms rate limits from fatal errors into manageable scheduling delays. Preserving partial state converts expensive failures into recoverable checkpoints. The architectural shift moves from "fire-and-forget" to "state-aware execution."
Core Solution
Building a production-ready AI queue pipeline requires coordinated configuration across three layers: the Horizon supervisor pool, the underlying process manager, and the job class itself. Each layer enforces boundaries that protect inference workloads from queue system defaults.
Step 1: Isolate AI Workloads in a Dedicated Supervisor Pool
Mixing AI inference with email dispatches or webhook processing creates resource contention. A single long-running LLM call can block workers needed for time-sensitive notifications. The solution is a dedicated supervisor with time-based auto-scaling and extended execution windows.
// config/horizon.php
return [
'environments' => [
'production' => [
'supervisor-llm-pipeline' => [
'connection' => 'redis',
'queue' => ['inference-batch', 'inference-realtime', 'inference-async'],
'balance' => 'auto',
'autoScalingStrategy' => 'time',
'minProcesses' => 4,
'maxProcesses' => 16,
'balanceMaxShift' => 3,
'balanceCooldown' => 8,
'timeout' => 300,
'sleep' => 5,
'tries' => 5,
'nice' => 0,
],
'supervisor-standard' => [
'connection' => 'redis',
'queue' => ['default', 'emails', 'webhooks'],
'balance' => 'simple',
'minProcesses'=> 2,
'maxProcesses'=> 8,
'timeout' => 60,
'sleep' => 3,
'tries' => 3,
],
],
],
];
Architecture Rationale:
autoScalingStrategy: timemeasures how long jobs sit in the queue before pickup. Queue length is misleading for AI workloads: three jobs waiting at 90 seconds each creates a 4.5-minute tail latency. Time-based scaling provisions workers based on actual user wait time.balanceCooldown: 8prevents thrashing. Inference workloads often arrive in bursts (e.g., batch document uploads). A 3-second cooldown causes the auto-balancer to over-provision, then rapidly scale down, wasting Redis connections and CPU cycles.timeout: 300establishes a hard ceiling. This is not a target execution time; it is a safety net. If jobs routinely approach 120 seconds, prompt optimization or context window reduction is required.
Step 2: Align the Process Manager Grace Period
Horizon runs as a daemon. During deployments, the process manager (Supervisord, systemd, or PM2) sends a termination signal. If the grace period is shorter than Horizon's timeout, in-flight inference calls are killed mid-stream.
; /etc/supervisor/conf.d/laravel-horizon.conf
[program:horizon-worker]
process_name=%(program_name)s
command=php /var/www/app/artisan horizon
autostart=true
autorestart=true
user=www-data
redirect_stderr=true
stdout_logfile=/var/www/app/storage/logs/horizon-worker.log
stopwaitsecs=360
Architecture Rationale:
stopwaitsecs must exceed the Horizon timeout by at least 60 seconds. This guarantees that a worker processing a 240-second inference call can complete the request, persist results, and gracefully exit before the OS forces termination. Rolling deployments will no longer truncate active API calls.
Step 3: Design the Job Class for State Awareness and Rate Limit Resilience
The supervisor defines the outer boundary. The job class defines internal
behavior. AI inference jobs require explicit timeout declaration, exponential backoff, rate limit differentiation, and partial state preservation.
<?php
namespace App\Jobs\Inference;
use Illuminate\Bus\Queueable;
use Illuminate\Contracts\Queue\ShouldQueue;
use Illuminate\Foundation\Bus\Dispatchable;
use Illuminate\Queue\InteractsWithQueue;
use Illuminate\Queue\SerializesModels;
use Illuminate\Queue\Middleware\RateLimited;
use Illuminate\Support\Facades\Log;
use App\Services\InferenceClient;
use App\Models\AnalysisTask;
class ExecuteModelInference implements ShouldQueue
{
use Dispatchable, InteractsWithQueue, Queueable, SerializesModels;
public int $timeout = 240;
public int $tries = 5;
public array $backoff = [30, 60, 120, 180, 240];
public function __construct(
public readonly string $taskId,
public readonly string $targetModel,
public readonly array $payload,
) {}
public function middleware(): array
{
return [new RateLimited('llm-inference-gateway')];
}
public function handle(InferenceClient $client): void
{
$task = AnalysisTask::findOrFail($this->taskId);
try {
$result = $client->generate(
model: $this->targetModel,
payload: $this->payload,
timeout: $this->timeout
);
$task->update([
'status' => 'completed',
'output_text' => $result->text,
'input_tokens' => $result->usage->promptTokens,
'output_tokens' => $result->usage->completionTokens,
'completed_at' => now(),
]);
} catch (\Throwable $exception) {
if ($this->isRateLimitSignal($exception)) {
$delay = $this->backoff[$this->attempts() - 1] ?? 240;
$this->release($delay);
return;
}
Log::error('Inference execution failed', [
'task_id' => $this->taskId,
'attempt' => $this->attempts(),
'model' => $this->targetModel,
'error' => $exception->getMessage(),
]);
throw $exception;
}
}
public function failed(\Throwable $exception): void
{
AnalysisTask::where('id', $this->taskId)->update([
'status' => 'failed',
'failure_reason' => $exception->getMessage(),
'partial_output' => $this->extractPartialState(),
'failed_at' => now(),
]);
Log::critical('Inference job exhausted retry budget', [
'task_id' => $this->taskId,
'model' => $this->targetModel,
]);
}
public function retryUntil(): \DateTime
{
return now()->addHours(4);
}
private function isRateLimitSignal(\Throwable $e): bool
{
$message = strtolower($e->getMessage());
return str_contains($message, '429')
|| str_contains($message, 'rate_limit')
|| str_contains($message, 'too_many_requests');
}
private function extractPartialState(): ?string
{
// Retrieve cached chunks or streaming buffer if available
return cache()->get("inference_partial_{$this->taskId}");
}
}
Architecture Rationale:
$timeout = 240sits below the supervisor's 300-second limit. This ensures Laravel can catch the timeout, log it, and trigger thefailed()method instead of receiving an uncatchableSIGKILL.$this->release()is used for rate limits instead of throwing. Throwing decrements the$triescounter.release()re-queues the job with a delay without consuming retry budget, treating429as a scheduling event rather than a failure.retryUntil()enforces a business deadline. Exponential backoff across five attempts can span hours. If the inference result is only valuable within a 4-hour window, this prevents wasteful retries on stale requests.failed()preserves partial state. Long context jobs often cache intermediate chunks. Storingpartial_outputenables resume logic or manual inspection, reducing redundant API costs.
Step 4: Register Granular Rate Limiters
The RateLimited middleware requires a named limiter. Global limits work for single-tenant setups, but multi-tenant applications require scoped throttling to prevent noisy neighbors from blocking inference pipelines.
// app/Providers/AppServiceProvider.php
use Illuminate\Cache\RateLimiting\Limit;
use Illuminate\Support\Facades\RateLimiter;
public function boot(): void
{
RateLimiter::for('llm-inference-gateway', function (object $job) {
$tenantScope = $job->tenantId ?? 'platform-wide';
// Anthropic Tier 2: ~1,000 RPM | OpenAI Tier 3: ~5,000 RPM
// Start conservative; adjust based on actual provider quota and cost targets.
return Limit::perMinute(80)->by("tenant:{$tenantScope}");
});
}
Architecture Rationale: Scoping by tenant isolates rate limit exhaustion. If one tenant triggers a burst, other tenants' inference jobs continue processing. The limit should align with your provider tier, but always leave headroom for retry backoff and network variance.
Pitfall Guide
1. Timeout Parity Trap
Explanation: Setting the job $timeout equal to or greater than the Horizon supervisor timeout guarantees silent termination. The OS kills the process before Laravel can execute exception handling.
Fix: Always set job $timeout to 80% of the supervisor limit. For a 300-second supervisor, use 240 seconds on the job.
2. Treating 429 as a Hard Failure
Explanation: Throwing an exception on rate limit responses consumes retry budget and triggers exponential backoff incorrectly. Rate limits are provider-side scheduling signals, not application bugs.
Fix: Use $this->release($delay) for 429 responses. This preserves the $tries counter and respects the provider's recovery window.
3. Queue Length Scaling Fallacy
Explanation: Scaling workers based on job count ignores execution duration. Three AI jobs waiting is trivial for email dispatch but catastrophic for inference.
Fix: Use autoScalingStrategy: time. Horizon will provision workers based on actual queue wait time, aligning capacity with latency requirements.
4. Silent Deployment Truncation
Explanation: Leaving stopwaitsecs at the default 10 seconds in Supervisord causes rolling deployments to kill in-flight inference calls. Users receive empty responses without error logs.
Fix: Set stopwaitsecs to supervisor_timeout + 60. Verify with a staging deployment that long-running jobs complete before the process exits.
5. State Wipe on Failure
Explanation: Standard failed() methods often reset status fields without preserving intermediate work. For expensive context assembly or chunking, this forces full recomputation.
Fix: Implement partial state caching during execution. Store intermediate results in Redis or a dedicated partial_output column. Restore them in failed() for audit or resume capabilities.
6. Global Rate Limiter Bottlenecks
Explanation: Using a single global rate limiter in multi-tenant applications causes one tenant's burst to throttle all other tenants' inference pipelines.
Fix: Scope the limiter using by("tenant:{$id}"). Adjust limits per tier if you offer different SLAs.
7. Missing Idempotency Keys
Explanation: AI providers may process duplicate requests if network timeouts cause Laravel to retry. Without idempotency, you pay twice and generate conflicting outputs.
Fix: Generate a deterministic idempotency_key based on task hash and payload. Pass it to the provider API. Most modern LLM endpoints support idempotent retries.
Production Bundle
Action Checklist
- Isolate AI queues in a dedicated Horizon supervisor with
autoScalingStrategy: time - Set supervisor
timeoutto 300s and job$timeoutto 240s to enable graceful error handling - Configure
stopwaitsecs=360in Supervisord to prevent deployment-time truncation - Implement
$this->release()for429responses instead of throwing exceptions - Register tenant-scoped rate limiters in
AppServiceProviderto prevent cross-tenant throttling - Add
retryUntil()to enforce business deadlines and prevent stale retries - Preserve partial state in
failed()methods to reduce redundant compute costs - Attach idempotency keys to all provider API calls to prevent duplicate billing
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|---|---|---|
| Startup / Low Volume | Single supervisor, global rate limit, 3 retries | Simplicity reduces operational overhead while validating product-market fit | Low infrastructure cost; acceptable retry waste |
| Multi-Tenant SaaS | Dedicated AI supervisor, tenant-scoped rate limits, 5 retries with backoff | Prevents noisy neighbor throttling and aligns scaling with actual latency | Moderate increase in Redis connections; reduced API waste from failed retries |
| Batch Processing / High Throughput | Time-based scaling, partial state caching, idempotency keys, 300s timeout | Handles burst uploads without blocking realtime queues; enables resume on failure | Higher worker count during peaks; significant savings from partial state reuse |
Configuration Template
// config/horizon.php
return [
'environments' => [
'production' => [
'supervisor-llm-pipeline' => [
'connection' => 'redis',
'queue' => ['inference-batch', 'inference-realtime'],
'balance' => 'auto',
'autoScalingStrategy' => 'time',
'minProcesses' => 4,
'maxProcesses' => 16,
'balanceMaxShift' => 3,
'balanceCooldown' => 8,
'timeout' => 300,
'sleep' => 5,
'tries' => 5,
'nice' => 0,
],
],
],
];
; /etc/supervisor/conf.d/laravel-horizon.conf
[program:horizon-worker]
process_name=%(program_name)s
command=php /var/www/app/artisan horizon
autostart=true
autorestart=true
user=www-data
redirect_stderr=true
stdout_logfile=/var/www/app/storage/logs/horizon-worker.log
stopwaitsecs=360
Quick Start Guide
- Install Horizon & Publish Config: Run
composer require laravel/horizon && php artisan horizon:install. Openconfig/horizon.phpand replace the default supervisor block with the AI-optimized template. - Align Process Manager: Update your Supervisord or systemd unit file. Set
stopwaitsecs=360and reload the service manager (supervisorctl reread && supervisorctl update). - Create the Job Class: Generate a new job (
php artisan make:job ExecuteModelInference). Implement the$timeout,$tries,$backoff, andrelease()pattern for rate limits. Register theRateLimitedmiddleware. - Deploy & Validate: Push to staging. Dispatch a test job with a large context window. Monitor Horizon's dashboard for queue wait time scaling. Verify that
429responses triggerrelease()without decrementing$tries. Confirmfailed_jobscaptures partial state on timeout.
