tokens are rejected immediately rather than blocking the upstream generator.
enum class BackpressurePolicy { DROP_OLDEST, CLOSE_CLIENT }
class ClientSink(
private val capacity: Int = 64,
private val policy: BackpressurePolicy = BackpressurePolicy.CLOSE_CLIENT
) {
private val buffer = Channel<String>(capacity)
suspend fun push(token: String): Boolean = buffer.trySend(token).isSuccess
suspend fun drainTo(writer: suspend (String) -> Unit) {
for (token in buffer) {
writer(token)
}
}
fun close() = buffer.close()
}
The trySend method is critical. Unlike send, which suspends until space is available, trySend returns immediately. This prevents the upstream LLM consumer from being blocked by a single slow client. The BackpressurePolicy enum allows you to define SLA-driven behavior: either drop the oldest token to keep the connection alive, or terminate the connection to preserve system stability.
Step 2: Implement Non-Blocking Fan-Out
The dispatcher consumes tokens from the LLM provider and fans them out to all active sinks. The fan-out logic must be decoupled from the HTTP response lifecycle to avoid blocking the generator.
class TokenDispatcher {
private val activeSinks = ConcurrentHashMap<String, ClientSink>()
fun register(clientId: String): ClientSink {
val sink = ClientSink()
activeSinks[clientId] = sink
return sink
}
fun deregister(clientId: String) {
activeSinks.remove(clientId)?.close()
}
suspend fun broadcast(token: String) {
activeSinks.values.forEach { sink ->
val delivered = sink.push(token)
if (!delivered) {
handleBackpressure(sink)
}
}
}
private fun handleBackpressure(sink: ClientSink) {
when (sink.policy) {
BackpressurePolicy.CLOSE_CLIENT -> sink.close()
BackpressurePolicy.DROP_OLDEST -> { /* Channel handles overflow internally */ }
}
}
}
This design ensures that the LLM stream continues at its native pace. Failed deliveries are handled synchronously within the broadcast loop, but because trySend is non-blocking, the loop never stalls. The ConcurrentHashMap provides thread-safe access without requiring explicit locks, which is essential when thousands of coroutines register and deregister simultaneously.
Step 3: Calculate Real Concurrency Limits
Memory budgeting dictates your actual ceiling. On a standard 4GB container, the JVM reserves approximately 30β40% for metaspace, thread stacks, and GC overhead. This leaves roughly 2.5GB for application heap. The per-connection footprint breaks down as follows:
- Coroutine context & stack: ~1.5 KB
- Bounded channel (64 slots Γ ~40 bytes per string reference): ~2.5 KB
- HTTP response writer & TCP buffer: ~8 KB
- Connection metadata & routing state: ~1 KB
- Total per connection: ~13 KB
At 10,000 concurrent streams, total overhead is approximately 130 MB. This leaves ample headroom for JVM operations, token string allocation, and burst traffic. The practical limit on a 4GB instance is 10,000β12,000 connections. Attempting to exceed this by increasing buffer sizes or disabling GC tuning will trigger compaction pauses and OOM kills. Horizontal scaling is the only sustainable path beyond this threshold.
Step 4: Orchestrate Graceful Termination
Kubernetes rolling deployments send SIGTERM to pods, which abruptly closes TCP connections if not handled. An SSE client receiving a reset connection has no way to distinguish between a network failure and a deployment. The solution is a structured drain coordinator that signals clients to reconnect before forcing closure.
class DrainCoordinator(private val defaultTimeout: Duration = Duration.ofSeconds(30)) {
suspend fun executeDrain(sinks: Collection<ClientSink>) {
val job = coroutineScope {
launch {
sinks.forEach { it.push("event: reconnect\ndata: {\"reason\":\"deployment\"}\n\n") }
delay(defaultTimeout)
}
}
job.join()
sinks.forEach { it.close() }
}
}
The coordinator injects a standard SSE reconnect event into every active buffer. Clients configured with standard retry logic will automatically establish new connections to healthy pods. The coroutineScope ensures that the drain process respects structured concurrency: if the pod is forcefully terminated before the timeout, all child coroutines cancel cooperatively, preventing zombie connections and resource leaks.
Pitfall Guide
1. Unbounded Buffer Accumulation
Explanation: Using Channel(Channel.UNLIMITED) or ArrayList for token storage allows slow clients to accumulate tens of thousands of tokens. At ~40 bytes per token, a single stalled connection can consume 2β5 MB of heap. Multiply this across hundreds of mobile clients, and the container exhausts memory within minutes.
Fix: Enforce hard capacity limits (32β128 slots) on all client channels. Treat buffer overflow as a backpressure signal, not an error condition.
2. Head-of-Line Blocking via Shared Queues
Explanation: Routing all tokens through a single shared channel forces every client to wait for the slowest consumer. The LLM generator suspends until the queue drains, increasing time-to-first-token (TTFT) for all users.
Fix: Fan-out to isolated per-client channels. Decouple the generator from consumer drain rates using non-blocking trySend or offer operations.
3. Ignoring JVM Heap Overhead
Explanation: Developers often calculate capacity based on total container memory (e.g., 4GB / 13KB β 300K connections). This ignores metaspace, JIT compilation caches, thread stacks, and GC reservation. The actual usable heap is typically 50β60% of total memory.
Fix: Budget against usable heap, not total RAM. Reserve 30β40% for JVM internals. Use -XX:MaxDirectMemorySize and -XX:ReservedCodeCacheSize to cap non-heap allocations.
4. Abrupt Pod Termination
Explanation: Kubernetes SIGTERM closes sockets immediately. SSE clients receive TCP RST packets, triggering generic network errors rather than intelligent retries. Users experience broken streams during routine deployments.
Fix: Implement a preStop lifecycle hook that triggers the drain coordinator. Inject a retry field in SSE events to standardize client-side reconnection behavior.
5. Leaked Coroutine Contexts
Explanation: Failing to tie SSE coroutines to the request lifecycle creates zombie tasks that continue consuming CPU and memory after client disconnect. Over time, this degrades throughput and increases GC pressure.
Fix: Wrap each connection handler in coroutineScope { } tied to the HTTP request. Cancel the scope explicitly on client disconnect or timeout. Use invokeOnCompletion to deregister sinks from the dispatcher.
6. Vertical Scaling Fallacy
Explanation: Increasing container size to 8GB or 16GB to support more connections delays the inevitable. Memory fragmentation, GC pause times, and network socket limits scale non-linearly. A single large pod becomes a blast radius for outages.
Fix: Design for horizontal scaling from day one. Use connection-count-based autoscaling metrics. Keep pods small, predictable, and easily replaceable.
7. Missing Client Retry Logic
Explanation: Assuming clients will automatically reconnect without explicit instructions leads to silent failures. Standard HTTP clients do not retry SSE connections on disconnect.
Fix: Always include retry: 3000 in the initial SSE stream. Document the reconnect event format. Provide client SDKs that handle exponential backoff and state resumption.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| < 5K concurrent streams | Single pod, bounded channels | Simpler ops, predictable memory | Low infrastructure cost |
| 5Kβ10K concurrent streams | Multi-pod cluster, connection-based HPA | Isolates blast radius, enables zero-downtime deploys | Moderate compute cost, higher network egress |
| > 10K concurrent streams | Regional edge proxies + central LLM gateway | Reduces latency, offloads fan-out to edge | High architectural complexity, increased CDN/proxy costs |
| Mobile/Unreliable networks | DROP_OLDEST policy + aggressive retry | Maintains connection, tolerates packet loss | Slightly higher token generation cost due to retries |
| Enterprise/Compliance | CLOSE_CLIENT policy + audit logging | Strict SLA enforcement, clear failure trails | Lower concurrency ceiling, higher support overhead |
Configuration Template
// ktor-application.conf
ktor {
deployment {
port = 8080
watch = []
}
application {
modules = [ com.example.StreamingAppKt.module ]
maxConnections = 10000
connectionIdleTimeout = 300000
}
}
// Application Module
fun Application.module() {
val dispatcher = TokenDispatcher()
val drainCoordinator = DrainCoordinator(Duration.ofSeconds(30))
install(CallLogging) {
level = Level.INFO
}
routing {
get("/stream/{clientId}") {
val clientId = call.parameters["clientId"] ?: return@get call.respondText("Missing ID", status = HttpStatusCode.BadRequest)
val sink = dispatcher.register(clientId)
call.respondBytesWriter(contentType = ContentType.Text.EventStream) {
write("retry: 3000\n\n")
try {
sink.drainTo { token ->
write("data: $token\n\n")
flush()
}
} finally {
dispatcher.deregister(clientId)
}
}
}
}
// PreStop hook integration
Runtime.getRuntime().addShutdownHook(Thread {
runBlocking {
drainCoordinator.executeDrain(dispatcher.activeSinks.values)
}
})
}
Quick Start Guide
- Initialize the dispatcher and drain coordinator in your application startup. Configure channel capacity based on your token loss tolerance (start with 64).
- Wire the SSE endpoint to register a new
ClientSink per request, stream tokens via drainTo, and deregister on completion or disconnect.
- Attach a shutdown hook that triggers the drain coordinator. Ensure your Kubernetes deployment manifest includes a
preStop sleep or HTTP call to allow the drain window to complete.
- Configure client-side retry logic to respect the
retry field and handle event: reconnect payloads. Implement exponential backoff for reconnection attempts.
- Deploy and monitor connection count, heap usage, and backpressure rejection rates. Adjust autoscaling thresholds based on observed drain times and token throughput.