Architecting High-Concurrency LLM Streams: Backpressure, Memory Budgeting, and Zero-Downtime SSE Delivery

Current Situation Analysis

Large language models generate output asynchronously, typically emitting tokens at 20–80 millisecond intervals. When you expose this generation process to end users via Server-Sent Events (SSE), each client maintains a long-lived HTTP connection that remains open until generation completes or times out. This architectural shift from request-response to persistent streaming introduces a fundamental mismatch: the producer (LLM inference engine) operates at a fixed, predictable cadence, while consumers (browsers, mobile apps, edge proxies) exhibit highly variable network conditions and processing speeds.

The industry standard approach to proxying these streams often treats SSE like a standard REST endpoint. Developers route tokens through unbounded queues or shared buffers, assuming the underlying TCP stack will naturally throttle delivery. This assumption collapses under concurrency. A single stalled mobile connection on a congested cellular network can accumulate tens of thousands of undelivered tokens. At an average token payload of ~40 bytes, 50,000 buffered tokens consume 2MB of heap space. Multiply this across hundreds of slow clients, and the application crosses into out-of-memory (OOM) territory long before CPU or network limits are reached.

This problem is frequently overlooked because local testing masks the issue. Development environments feature low-latency loops and fast consumers, allowing unbounded buffers to drain instantly. In production, however, network jitter, client-side JavaScript event loop blocking, and proxy timeouts create persistent backlogs. Without explicit backpressure mechanisms, the JVM heap becomes a leaky bucket. Garbage collection pauses increase as allocation rates spike, eventually triggering container orchestrators to kill the process. The result is a cascading failure where one slow client indirectly terminates thousands of healthy streams.

Empirical data from production deployments confirms the ceiling. On a standard 4GB container, JVM overhead, metaspace, and GC safety margins consume roughly 1.5GB, leaving ~2.5GB for application heap. Each SSE connection requires approximately 13KB of resident memory (coroutine stack, channel buffer, HTTP response frame, and routing metadata). This yields a theoretical maximum of ~12,000 concurrent streams. In practice, targeting 8,000–10,000 connections preserves headroom for traffic bursts, GC compaction, and safe rolling deployments. Exceeding this threshold without architectural discipline guarantees instability.

WOW Moment: Key Findings

The difference between a fragile streaming proxy and a production-grade delivery system boils down to buffer topology and backpressure enforcement. The following comparison isolates the architectural patterns that determine system resilience under load.

Architecture Pattern	Memory Trajectory	Backpressure Behavior	Failure Surface
Unbounded Fan-Out	Linear growth per stalled client	None	Cascading OOM, total service outage
Single Shared Queue	Fixed allocation	Head-of-line blocking	Global stall, all clients timeout
Per-Client Bounded Channel	Predictable ceiling	Immediate rejection	Isolated disconnect, SLA preserved

This finding matters because it shifts the failure model from catastrophic to controlled. Bounded channels per subscriber transform a network anomaly into a graceful degradation event. When a client cannot consume at the generation rate, the buffer fills, the publisher detects the rejection, and the connection terminates cleanly. The remaining 9,999 streams continue uninterrupted. This isolation is the foundation of horizontal scaling: you can predict memory consumption, set accurate container limits, and deploy rolling updates without triggering mass disconnects.

Core Solution

Building a resilient LLM streaming proxy requires three coordinated mechanisms: non-blocking fan-out, explicit memory budgeting, and cooperative lifecycle management. The implementation leverages Kotlin coroutines and bounded channels to enforce backpressure at the application layer, decoupling the inference engine's output rate from client network conditions.

Step 1: Design the Fan-Out Topology

Each SSE connection receives a dedicated bounded channel. The upstream coroutine consumes tokens from the LLM provider and attempts to distribute them to all active subscribers. Using trySend ensures the publisher never blocks. If a subscriber's buffer is full, the send operation fails immediately, allowing the system to apply a backpressure policy without stalling the generation pipeline.

import kotlinx.coroutines.*
import kotlinx.coroutines.channels.*

class TokenRelay(private val maxBuffer: Int = 64) {
    private val subscribers = mutableListOf<Channel<String>>()
    private val relayScope = CoroutineScope(Dispatchers.IO + SupervisorJob())

    fun subscribe(): Channel<String> {
        val clientBuffer = Channel<String>(capacity = maxBuffer)
        subscribers.add(clientBuffer)
        return clientBuffer
    }

    fun publish(token: String) {
        relayScope.launch {
            subscribers.removeAll { subscriber ->
                val deliveryStatus = subscriber.trySend(token)
                if (deliveryStatus.isFailure) {
                    subscriber.close(DeliveryBackpressureException("Buffer full"))
                    true
                } else {
                    false
                }
            }
        }
    }

    fun shutdown() {
        subscribers.forEach { it.close() }
        subscribers.clear()
        relayScope.cancel()
    }
}

class DeliveryBackpressureException(message: String) : Exception(message)

Architecture Rationale:

Bounded capacity (32–128 slots): Matches typical network round-trip times. A 64-slot buffer holds ~2.5KB per connection, balancing memory efficiency with tolerance for micro-bursts.
Non-blocking distribution: trySend prevents the upstream coroutine from waiting on slow consumers. The relay loop continues processing new tokens from the LLM regardless of client state.
Automatic cleanup: Failed deliveries trigger channel closure and remove the subscriber from the active list, preventing memory leaks and zombie references.

Step 2: Integrate with the HTTP Layer

The SSE endpoint reads from the assigned channel and writes directly to the HTTP response stream. Structured concurrency ties the subscription lifecycle to the request scope, ensuring coroutines cancel when the client disconnects or the server initiates a drain.

import io.ktor.server.application.*
import io.ktor.server.response.*
import io.ktor.server.routing.*
import io.ktor.http.*
import kotlinx.coroutines.flow.*

fun Application.configureStreamingRoutes(relay: TokenRelay) {
    routing {
        get("/stream/generation") {
            val clientChannel = relay.subscribe()
            call.response.headers.append("Content-Type", "text/event-stream")
            call.response.headers.append("Cache-Control", "no-cache")
            call.response.headers.append("Connection", "keep-alive")

            try {
                clientChannel.consumeAsFlow().collect { token ->
                    call.respondTextWriter {
                        write("data: $token\n\n")
                        flush()
                    }
                }
            } catch (e: DeliveryBackpressureException) {
                call.respondTextWriter {
                    write("event: error\ndata: {\"reason\":\"backpressure\"}\n\n")
                    flush()
                }
            } finally {
                relay.unsubscribe(clientChannel)
            }
        }
    }
}

Why this structure works:

consumeAsFlow() provides backpressure-aware iteration. The coroutine suspends when the channel is empty and resumes on new tokens.
Explicit flush() calls ensure tokens reach the client immediately, respecting the 20–80ms generation cadence.
The finally block guarantees cleanup, preventing orphaned channels from accumulating during client drops or network interruptions.

Step 3: Execute Memory Budgeting

Predictable scaling requires accounting for every byte allocated per connection. The following breakdown reflects production measurements on a 4GB container running OpenJDK 17 with G1GC:

Component	Allocation	Rationale
Coroutine Stack	1–2 KB	Lightweight execution context, auto-scaled by runtime
Channel Buffer (64 slots)	~2.5 KB	Fixed capacity, prevents unbounded growth
HTTP Response Frame	~8 KB	Ktor/Netty internal write buffer
Routing Metadata	~1 KB	Headers, correlation IDs, TLS session state
Total Per Connection	~13 KB	Deterministic ceiling

At 10,000 concurrent streams, total memory consumption sits near 130MB. This leaves ~2.3GB for JVM heap, GC overhead, and application logic. The math confirms that vertical scaling has a hard ceiling. Beyond 10K connections, horizontal pod autoscaling becomes mandatory. Increasing buffer sizes to "improve reliability" directly reduces concurrency capacity and increases GC pressure. The correct lever is pod replication, not buffer inflation.

Pitfall Guide

1. Unbounded Queue Accumulation

Explanation: Using Channel(Channel.UNLIMITED) or ArrayList for token buffering allows memory to grow linearly with client latency. A single stalled connection can consume megabytes, triggering OOM kills. Fix: Enforce bounded capacity (32–128). Treat buffer full as a signal to disconnect, not a reason to expand memory.

2. Blocking the Upstream Publisher

Explanation: Calling send() instead of trySend() suspends the relay coroutine until a slow client drains its buffer. This stalls token ingestion from the LLM, increasing time-to-first-token for all users. Fix: Always use non-blocking send operations. Decouple ingestion rate from delivery rate.

3. Ignoring Pod Lifecycle Signals

Explanation: Kubernetes sends SIGTERM during rolling updates. Without a drain handler, the process terminates immediately, resetting TCP connections and leaving clients with broken streams and no retry guidance. Fix: Implement a pre-stop hook that stops accepting new connections, broadcasts a reconnect event, and waits for a configurable deadline before force-closing.

4. Over-Provisioning Channel Capacity

Explanation: Setting channel capacity to 1024+ slots "just in case" multiplies per-connection memory by 10x. This reduces the maximum concurrent connections from ~10K to ~1K on the same hardware. Fix: Size buffers based on network RTT and token frequency. 64 slots covers ~3–5 seconds of generation, which is sufficient for transient jitter.

5. Leaking Coroutines on Client Drop

Explanation: Failing to cancel the subscription coroutine when the HTTP connection closes leaves dangling channels in the relay list. Over time, the subscriber list grows, consuming memory and CPU during fan-out iterations. Fix: Tie subscription lifecycle to request scope. Use try-finally blocks or structured concurrency to guarantee cleanup.

6. Hardcoding Drain Deadlines Without SLA Alignment

Explanation: Using a fixed 30-second drain window without considering average generation duration causes premature termination of long-running completions. Fix: Align drain deadlines with your longest expected generation time + network overhead. Expose the deadline as a configurable parameter and monitor completion rates.

7. Missing Backpressure Metrics

Explanation: Operating without visibility into buffer saturation means you cannot detect slow-client trends or capacity limits until production incidents occur. Fix: Instrument channel rejection rates, active subscriber counts, and memory allocation per pod. Alert on rejection rate spikes exceeding 5% over a 5-minute window.

Production Bundle

Action Checklist

Define bounded channel capacity (32–128) based on target RTT and token frequency
Replace all blocking send() calls with trySend() in the fan-out loop
Implement structured concurrency scopes tied to HTTP request lifecycles
Add a pre-stop drain handler that broadcasts reconnect events and enforces a deadline
Configure JVM heap limits to reserve 30–40% for GC and metaspace overhead
Instrument rejection rates, active connections, and per-pod memory allocation
Validate horizontal scaling thresholds before reaching 80% of theoretical connection ceiling

Decision Matrix

Scenario	Recommended Approach	Why	Cost Impact
< 5K concurrent streams	Single pod, 4GB RAM	Memory budget supports load with GC headroom	Baseline infrastructure cost
5K–10K concurrent streams	Single pod, 4GB RAM + connection draining	Predictable memory, isolated backpressure	No additional compute cost
> 10K concurrent streams	Horizontal pod autoscaling (HPA)	Vertical ceiling reached; buffer inflation degrades GC	Increased pod count, higher cloud spend
High mobile/edge traffic	Smaller buffers (32 slots) + aggressive disconnect	Network volatility requires faster failure detection	Lower memory per connection, higher reconnect rate
Enterprise SLA (99.95%)	Dedicated relay pods + circuit breakers	Isolate streaming load from core API services	Architectural complexity, dedicated resource allocation

Configuration Template

# Kubernetes Deployment Snippet
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-stream-relay
spec:
  replicas: 3
  template:
    spec:
      containers:
      - name: relay-service
        image: myregistry/llm-relay:latest
        resources:
          requests:
            memory: "2Gi"
            cpu: "1"
          limits:
            memory: "4Gi"
            cpu: "2"
        lifecycle:
          preStop:
            httpGet:
              path: /internal/drain
              port: 8080
            terminationGracePeriodSeconds: 45
        env:
        - name: CHANNEL_CAPACITY
          value: "64"
        - name: DRAIN_DEADLINE_MS
          value: "30000"
        - name: JVM_OPTS
          value: "-XX:+UseG1GC -XX:MaxGCPauseMillis=200 -Xms2g -Xmx3g"

// Ktor Drain Endpoint Implementation
fun Application.configureDrainEndpoint(relay: TokenRelay) {
    routing {
        get("/internal/drain") {
            call.respondText("Draining connections...")
            relay.broadcastReconnectEvent(reason = "rolling_update")
            delay(30000) // Align with terminationGracePeriodSeconds
            relay.shutdown()
        }
    }
}

Quick Start Guide

Initialize the relay: Instantiate TokenRelay with a bounded capacity of 64. Register it as a singleton in your application context.
Wire the SSE route: Expose /stream/generation that subscribes to the relay, sets text/event-stream headers, and iterates over the channel using consumeAsFlow().
Connect the LLM producer: Pass incoming tokens from your inference client into relay.publish(token). Ensure the producer runs on a dedicated IO dispatcher.
Deploy with lifecycle hooks: Apply the Kubernetes pre-stop configuration to enable graceful draining. Set terminationGracePeriodSeconds to match your drain deadline.
Validate under load: Use a synthetic client generator to open 8,000 concurrent connections. Monitor memory allocation, rejection rates, and GC pause times. Adjust buffer capacity or pod count if rejection rates exceed 2%.