Difficulty

Intermediate

Read Time

10 min

Laravel Horizon in Production: Configuring AI Queue Workloads That Actually Hold

By Codcompass Team·2026-05-15·10 min read

Architecting Resilient LLM Pipelines in Laravel: Queue Supervisor Tuning for Long-Running Inference

Current Situation Analysis

Traditional queue architectures were engineered for deterministic, short-lived tasks. Email dispatches, image resizing, and database synchronization typically complete within milliseconds to a few seconds. Laravel Horizon inherits these assumptions by default: a 60-second execution window, three retry attempts with zero delay, and scaling logic driven by queue depth. When you introduce generative AI workloads, these defaults become operational liabilities.

LLM inference operates on fundamentally different timing characteristics. A claude-sonnet-4-6 request with a dense system prompt and extended context window frequently approaches 45 seconds before streaming begins. Batch summarization tasks routed through gemini-2.5-pro can easily exceed two minutes under concurrent load. OpenAI's gpt-4o exhibits similar variance depending on token volume and network routing. The mismatch between queue expectations and inference reality creates three critical failure patterns:

Silent Process Termination: When Horizon's 60-second supervisor timeout triggers, the worker receives a SIGKILL. The operating system terminates the process immediately. No Laravel exception is caught, no failed_jobs record is created, and the job vanishes from observability. Teams report "disappearing jobs" because the failure occurs below the application layer.
Rate Limit Budget Exhaustion: Provider 429 Too Many Requests responses are transient scheduling signals, not application errors. Laravel's default retry behavior attempts immediate re-queuing. Without explicit backoff configuration, a single rate-limited request can consume all five retry attempts in under 15 seconds, permanently failing a job that would have succeeded with a 30-second pause.
Partial State Discard: Inference pipelines often perform expensive preprocessing, chunking, or context assembly before the API call. When a job fails mid-execution, standard failure handlers wipe the database record. For long-document processing, this means discarding 80% of the work and incurring full retry costs.

These issues are routinely overlooked because developers configure AI jobs using the same patterns as notification dispatchers. The queue system is treated as a black box rather than a resource scheduler that requires workload-specific tuning.

WOW Moment: Key Findings

The operational divergence between standard task queues and AI inference pipelines becomes quantifiable when measuring execution windows, retry behavior, scaling triggers, and failure recovery.

Configuration Dimension	Standard Queue Defaults	AI-Optimized Horizon Setup	Operational Impact
Execution Window	60 seconds	240–300 seconds	Prevents silent `SIGKILL` termination during token streaming
Retry Strategy	3 attempts, 0s delay	5 attempts, exponential backoff (30–240s)	Preserves retry budget against transient `429` rate limits
Scaling Signal	Queue length (job count)	Queue wait time (seconds)	Aligns worker provisioning with actual latency, not arbitrary depth
Failure Recovery	Full state reset	Partial state preservation + error tagging	Reduces redundant compute costs and enables resume-capable pipelines
Process Manager Grace	10 seconds (`stopwaitsecs`)	360 seconds	Prevents deployment-time truncation of in-flight inference calls

This comparison reveals that AI workloads require a scheduler, not just a queue. Time-based scaling catches latency spikes before they cascade into user-facing timeouts. Exponential backoff transforms rate limits from fatal errors into manageable scheduling delays. Preserving partial state converts expensive failures into recoverable checkpoints. The architectural shift moves from "fire-and-forget" to "state-aware execution."

Core Solution

Building a production-ready AI queue pipeline requires coordinated configuration across three layers: the Horizon supervisor pool, the underlying process manager, and the job class itself. Each layer enforces boundaries that protect inference workloads from queue system defaults.

Step 1: Isolate AI Workloads in a Dedicated Supervisor Pool

Mixing AI inference with email dispatches or webhook processing creates resource contention. A single long-running LLM call can block workers needed for time-sensitive notifications. The solution is a dedicated supervisor with time-based auto-scaling and extended execution windows.

// config/horizon.php

return [
    'environments' => [
        'production' => [
            'supervisor-llm-pipeline' => [
                'connection'          => 'redis',
                'queue'               => ['inference-batch', 'inference-realtime', 'inference-async'],
                'balance'             => 'auto',
                'autoScalingStrategy' => 'time',
                'minProcesses'        => 4,
                'maxProcesses'        => 16,
                'balanceMaxShift'     => 3,
                'balanceCooldown'     => 8,
                'timeout'             => 300,
                'sleep'               => 5,
                'tries'               => 5,
                'nice'                => 0,
            ],

            'supervisor-standard' => [
                'connection'  => 'redis',
                'queue'       => ['default', 'emails', 'webhooks'],
                'balance'     => 'simple',
                'minProcesses'=> 2,
                'maxProcesses'=> 8,
                'timeout'     => 60,
                'sleep'       => 3,
                'tries'       => 3,
            ],
        ],
    ],
];

Architecture Rationale:

autoScalingStrategy: time measures how long jobs sit in the queue before pickup. Queue length is misleading for AI workloads: three jobs waiting at 90 seconds each creates a 4.5-minute tail latency. Time-based scaling provisions workers based on actual user wait time.
balanceCooldown: 8 prevents thrashing. Inference workloads often arrive in bursts (e.g., batch document uploads). A 3-second cooldown causes the auto-balancer to over-provision, then rapidly scale down, wasting Redis connections and CPU cycles.
timeout: 300 establishes a hard ceiling. This is not a target execution time; it is a safety net. If jobs routinely approach 120 seconds, prompt optimization or context window reduction is required.

Step 2: Align the Process Manager Grace Period

Horizon runs as a daemon. During deployments, the process manager (Supervisord, systemd, or PM2) sends a termination signal. If the grace period is shorter than Horizon's timeout, in-flight inference calls are killed mid-stream.

; /etc/supervisor/conf.d/laravel-horizon.conf

[program:horizon-worker]
process_name=%(program_name)s
command=php /var/www/app/artisan horizon
autostart=true
autorestart=true
user=www-data
redirect_stderr=true
stdout_logfile=/var/www/app/storage/logs/horizon-worker.log
stopwaitsecs=360

Architecture Rationale: stopwaitsecs must exceed the Horizon timeout by at least 60 seconds. This guarantees that a worker processing a 240-second inference call can complete the request, persist results, and gracefully exit before the OS forces termination. Rolling deployments will no longer truncate active API calls.

Step 3: Design the Job Class for State Awareness and Rate Limit Resilience

The supervisor defines the outer boundary. The job class defines internal

behavior. AI inference jobs require explicit timeout declaration, exponential backoff, rate limit differentiation, and partial state preservation.

<?php

namespace App\Jobs\Inference;

use Illuminate\Bus\Queueable;
use Illuminate\Contracts\Queue\ShouldQueue;
use Illuminate\Foundation\Bus\Dispatchable;
use Illuminate\Queue\InteractsWithQueue;
use Illuminate\Queue\SerializesModels;
use Illuminate\Queue\Middleware\RateLimited;
use Illuminate\Support\Facades\Log;
use App\Services\InferenceClient;
use App\Models\AnalysisTask;

class ExecuteModelInference implements ShouldQueue
{
    use Dispatchable, InteractsWithQueue, Queueable, SerializesModels;

    public int $timeout = 240;
    public int $tries = 5;
    public array $backoff = [30, 60, 120, 180, 240];

    public function __construct(
        public readonly string $taskId,
        public readonly string $targetModel,
        public readonly array  $payload,
    ) {}

    public function middleware(): array
    {
        return [new RateLimited('llm-inference-gateway')];
    }

    public function handle(InferenceClient $client): void
    {
        $task = AnalysisTask::findOrFail($this->taskId);

        try {
            $result = $client->generate(
                model: $this->targetModel,
                payload: $this->payload,
                timeout: $this->timeout
            );

            $task->update([
                'status'        => 'completed',
                'output_text'   => $result->text,
                'input_tokens'  => $result->usage->promptTokens,
                'output_tokens' => $result->usage->completionTokens,
                'completed_at'  => now(),
            ]);

        } catch (\Throwable $exception) {
            if ($this->isRateLimitSignal($exception)) {
                $delay = $this->backoff[$this->attempts() - 1] ?? 240;
                $this->release($delay);
                return;
            }

            Log::error('Inference execution failed', [
                'task_id' => $this->taskId,
                'attempt' => $this->attempts(),
                'model'   => $this->targetModel,
                'error'   => $exception->getMessage(),
            ]);

            throw $exception;
        }
    }

    public function failed(\Throwable $exception): void
    {
        AnalysisTask::where('id', $this->taskId)->update([
            'status'          => 'failed',
            'failure_reason'  => $exception->getMessage(),
            'partial_output'  => $this->extractPartialState(),
            'failed_at'       => now(),
        ]);

        Log::critical('Inference job exhausted retry budget', [
            'task_id' => $this->taskId,
            'model'   => $this->targetModel,
        ]);
    }

    public function retryUntil(): \DateTime
    {
        return now()->addHours(4);
    }

    private function isRateLimitSignal(\Throwable $e): bool
    {
        $message = strtolower($e->getMessage());
        return str_contains($message, '429')
            || str_contains($message, 'rate_limit')
            || str_contains($message, 'too_many_requests');
    }

    private function extractPartialState(): ?string
    {
        // Retrieve cached chunks or streaming buffer if available
        return cache()->get("inference_partial_{$this->taskId}");
    }
}

Architecture Rationale:

$timeout = 240 sits below the supervisor's 300-second limit. This ensures Laravel can catch the timeout, log it, and trigger the failed() method instead of receiving an uncatchable SIGKILL.
$this->release() is used for rate limits instead of throwing. Throwing decrements the $tries counter. release() re-queues the job with a delay without consuming retry budget, treating 429 as a scheduling event rather than a failure.
retryUntil() enforces a business deadline. Exponential backoff across five attempts can span hours. If the inference result is only valuable within a 4-hour window, this prevents wasteful retries on stale requests.
failed() preserves partial state. Long context jobs often cache intermediate chunks. Storing partial_output enables resume logic or manual inspection, reducing redundant API costs.

Step 4: Register Granular Rate Limiters

The RateLimited middleware requires a named limiter. Global limits work for single-tenant setups, but multi-tenant applications require scoped throttling to prevent noisy neighbors from blocking inference pipelines.

// app/Providers/AppServiceProvider.php

use Illuminate\Cache\RateLimiting\Limit;
use Illuminate\Support\Facades\RateLimiter;

public function boot(): void
{
    RateLimiter::for('llm-inference-gateway', function (object $job) {
        $tenantScope = $job->tenantId ?? 'platform-wide';
        
        // Anthropic Tier 2: ~1,000 RPM | OpenAI Tier 3: ~5,000 RPM
        // Start conservative; adjust based on actual provider quota and cost targets.
        return Limit::perMinute(80)->by("tenant:{$tenantScope}");
    });
}

Architecture Rationale: Scoping by tenant isolates rate limit exhaustion. If one tenant triggers a burst, other tenants' inference jobs continue processing. The limit should align with your provider tier, but always leave headroom for retry backoff and network variance.

Pitfall Guide

1. Timeout Parity Trap

Explanation: Setting the job $timeout equal to or greater than the Horizon supervisor timeout guarantees silent termination. The OS kills the process before Laravel can execute exception handling. Fix: Always set job $timeout to 80% of the supervisor limit. For a 300-second supervisor, use 240 seconds on the job.

2. Treating 429 as a Hard Failure

Explanation: Throwing an exception on rate limit responses consumes retry budget and triggers exponential backoff incorrectly. Rate limits are provider-side scheduling signals, not application bugs. Fix: Use $this->release($delay) for 429 responses. This preserves the $tries counter and respects the provider's recovery window.

3. Queue Length Scaling Fallacy

Explanation: Scaling workers based on job count ignores execution duration. Three AI jobs waiting is trivial for email dispatch but catastrophic for inference. Fix: Use autoScalingStrategy: time. Horizon will provision workers based on actual queue wait time, aligning capacity with latency requirements.

4. Silent Deployment Truncation

Explanation: Leaving stopwaitsecs at the default 10 seconds in Supervisord causes rolling deployments to kill in-flight inference calls. Users receive empty responses without error logs. Fix: Set stopwaitsecs to supervisor_timeout + 60. Verify with a staging deployment that long-running jobs complete before the process exits.

5. State Wipe on Failure

Explanation: Standard failed() methods often reset status fields without preserving intermediate work. For expensive context assembly or chunking, this forces full recomputation. Fix: Implement partial state caching during execution. Store intermediate results in Redis or a dedicated partial_output column. Restore them in failed() for audit or resume capabilities.

6. Global Rate Limiter Bottlenecks

Explanation: Using a single global rate limiter in multi-tenant applications causes one tenant's burst to throttle all other tenants' inference pipelines. Fix: Scope the limiter using by("tenant:{$id}"). Adjust limits per tier if you offer different SLAs.

7. Missing Idempotency Keys

Explanation: AI providers may process duplicate requests if network timeouts cause Laravel to retry. Without idempotency, you pay twice and generate conflicting outputs. Fix: Generate a deterministic idempotency_key based on task hash and payload. Pass it to the provider API. Most modern LLM endpoints support idempotent retries.

Production Bundle

Action Checklist

Isolate AI queues in a dedicated Horizon supervisor with autoScalingStrategy: time
Set supervisor timeout to 300s and job $timeout to 240s to enable graceful error handling
Configure stopwaitsecs=360 in Supervisord to prevent deployment-time truncation
Implement $this->release() for 429 responses instead of throwing exceptions
Register tenant-scoped rate limiters in AppServiceProvider to prevent cross-tenant throttling
Add retryUntil() to enforce business deadlines and prevent stale retries
Preserve partial state in failed() methods to reduce redundant compute costs
Attach idempotency keys to all provider API calls to prevent duplicate billing

Decision Matrix

Scenario	Recommended Approach	Why	Cost Impact
Startup / Low Volume	Single supervisor, global rate limit, 3 retries	Simplicity reduces operational overhead while validating product-market fit	Low infrastructure cost; acceptable retry waste
Multi-Tenant SaaS	Dedicated AI supervisor, tenant-scoped rate limits, 5 retries with backoff	Prevents noisy neighbor throttling and aligns scaling with actual latency	Moderate increase in Redis connections; reduced API waste from failed retries
Batch Processing / High Throughput	Time-based scaling, partial state caching, idempotency keys, 300s timeout	Handles burst uploads without blocking realtime queues; enables resume on failure	Higher worker count during peaks; significant savings from partial state reuse

Configuration Template

// config/horizon.php
return [
    'environments' => [
        'production' => [
            'supervisor-llm-pipeline' => [
                'connection'          => 'redis',
                'queue'               => ['inference-batch', 'inference-realtime'],
                'balance'             => 'auto',
                'autoScalingStrategy' => 'time',
                'minProcesses'        => 4,
                'maxProcesses'        => 16,
                'balanceMaxShift'     => 3,
                'balanceCooldown'     => 8,
                'timeout'             => 300,
                'sleep'               => 5,
                'tries'               => 5,
                'nice'                => 0,
            ],
        ],
    ],
];

; /etc/supervisor/conf.d/laravel-horizon.conf
[program:horizon-worker]
process_name=%(program_name)s
command=php /var/www/app/artisan horizon
autostart=true
autorestart=true
user=www-data
redirect_stderr=true
stdout_logfile=/var/www/app/storage/logs/horizon-worker.log
stopwaitsecs=360

Quick Start Guide

Install Horizon & Publish Config: Run composer require laravel/horizon && php artisan horizon:install. Open config/horizon.php and replace the default supervisor block with the AI-optimized template.
Align Process Manager: Update your Supervisord or systemd unit file. Set stopwaitsecs=360 and reload the service manager (supervisorctl reread && supervisorctl update).
Create the Job Class: Generate a new job (php artisan make:job ExecuteModelInference). Implement the $timeout, $tries, $backoff, and release() pattern for rate limits. Register the RateLimited middleware.
Deploy & Validate: Push to staging. Dispatch a test job with a large context window. Monitor Horizon's dashboard for queue wait time scaling. Verify that 429 responses trigger release() without decrementing $tries. Confirm failed_jobs captures partial state on timeout.