Decoupling Stateful Execution Loops: A Stateless Worker Pattern for High-Concurrency Event Processing

Current Situation Analysis

Real-time event processing systems—whether powering live game sessions, collaborative editing tools, or auction platforms—frequently begin with a coroutine-per-request execution model. This approach is attractive during early development: it abstracts asynchronous I/O into synchronous-looking code, requires minimal infrastructure, and allows non-engineering teams to script business logic quickly. However, this model carries a hidden scaling tax that only surfaces under sustained concurrency.

The fundamental misunderstanding lies in assuming that lightweight coroutines scale linearly with user count. In practice, coroutines are not free. Each suspended coroutine consumes stack memory, holds file descriptors, and participates in the runtime scheduler's run queue. When I/O latency increases—even marginally—the scheduler must manage thousands of suspended contexts simultaneously. This triggers context-switch storms, CPU steal time spikes, and allocator fragmentation. Teams typically misdiagnose these symptoms as database or cache bottlenecks, spending cycles tuning connection pools, disabling durability flags, or adding read replicas. The actual constraint is the execution boundary itself.

Empirical evidence from production deployments consistently shows this pattern. At approximately 300 concurrent sessions, Redis latency can jump from sub-millisecond to tens of milliseconds due to connection pool saturation. PostgreSQL autovacuum cycles, triggered by high-churn write patterns, introduce freeze windows of 400–600ms on INSERT operations. The combined effect degrades P99 latency past acceptable SLAs (often >1.5s). As concurrency scales to 3,000 sessions, runtime memory consumption can balloon to 8GB RSS, causing the memory allocator to stall and the scheduler to thrash. The system doesn't fail because the database is slow; it fails because the runtime is drowning in suspended state.

WOW Moment: Key Findings

The breakthrough occurs when you shift from a persistent, stateful execution model to an ephemeral, stateless worker pattern. By decoupling the I/O wait from the CPU scheduler and moving persistence to an async event stream, you eliminate context-switch overhead and bound memory usage per request.

Execution Model	P99 Latency	Context Switches/sec	Memory Footprint (3k sessions)	DB Write Pattern	Scaling Granularity
Coroutine-Per-Session	1.8 s	~20,000	8 GB RSS	Synchronous, fsync-bound	Coarse (process-level)
Stateless Short-Lived Worker	72 ms	<500	<120 MB RSS	Async, batched via event stream	Fine (pod-level, scale-to-zero)

This finding matters because it transforms a vertically constrained runtime into a horizontally elastic compute layer. The stateless worker model ensures that I/O latency never blocks CPU scheduling, memory usage remains predictable, and infrastructure costs align directly with active request volume rather than peak concurrency. It also enables eventual consistency patterns that are invisible to the client but drastically reduce write amplification.

Core Solution

The architecture replaces long-running coroutines with ephemeral worker processes that execute a single business logic cycle and terminate. The implementation spans five coordinated layers: request routing, worker execution, event ingestion, state materialization, and dynamic scaling.

Step 1: Define the Execution Boundary

Instead of spawning a coroutine that lives for the duration of a session, the system spawns a short-lived process per event. The process receives only the session identifier, loads compiled bytecode, executes the logic, emits an event, and exits. This eliminates scheduler contention and bounds memory to the lifecycle of a single request.

Step 2: Implement the Worker Runtime

Workers run in isolated containers optimized for cold-start speed. The runtime uses LuaJIT 2.1 paired with musl libc to minimize binary size. Business scripts are pre-compiled to bytecode during CI/CD, eliminating JIT compilation overhead at runtime.

Worker Script (TypeScript orchestrator + Lua worker logic)

// orchestrator.ts
import { spawn } from 'child_process';
import { createHash } from 'crypto';

export class SessionOrchestrator {
  private readonly workerBinary: string = '/opt/bin/session_proc';
  private readonly maxRetries: number = 2;

  async dispatch(sessionId: string, payload: Record<string, unknown>): Promise<void> {
    const idempotencyKey = createHash('sha256').update(`${sessionId}:${Date.now()}`).digest('hex').slice(0, 12);
    
    for (let attempt = 0; attempt <= this.maxRetries; attempt++) {
      await this.executeWorker(sessionId, payload, idempotencyKey);
    }
  }

  private executeWorker(sessionId: string, payload: Record<string, unknown>, key: string): Promise<void> {
    return new Promise((resolve, reject) => {
      const proc = spawn(this.workerBinary, [sessionId, JSON.stringify(payload), key]);
      
      proc.on('close', (code) => {
        if (code === 0) resolve();
        else reject(new Error(`Worker exited with code ${code}`));
      });
      
      proc.on('error', reject);
    });
  }
}

-- session_proc.lua (compiled to bytecode)
local ffi = require("ffi")
local redis = require("resty.redis")
local http = require("resty.http")

local session_id = arg[1]
local payload = cjson.decode(arg[2])
local idempotency_key = arg[3]

-- Atomic state check via Redis
local red = redis:new()
red:set_timeout(50)
red:connect("redis.internal", 6379)
local result = red:evalsha(
  "return redis.call('SETNX', KEYS[1], ARGV[1])",
  1,
  "proc:" .. session_id .. ":lock",
  idempotency_key
)

if result == 0 then
  os.exit(0) -- Already processed
end

-- Execute business logic
local reward = calculate_reward(payload.difficulty, payload.player_level)

-- Emit event via HTTP fire-and-forget
local httpc = http.new()
httpc:request_uri(
  "http://kafka-proxy.internal:8080/topics/session_events",
  {
    method = "POST",
    body = cjson.encode({
      session_id = session_id,
      event_type = "reward_issued",
      payload = reward,
      ts = ngx.now()
    }),
    keepalive_timeout = 1000
  }
)

os.exit(0)

Architecture Rationale:

Fork-exec over threads: Eliminates shared memory contention and garbage collection pauses.
Pre-compiled bytecode: Removes JIT warmup latency, ensuring consistent cold-start times (<10ms).
Idempotency via Redis SETNX: Guarantees exactly-once semantics despite at-least-once delivery guarantees from the proxy.

Step 3: Route Traffic with Consistent Hashing

A reverse proxy (Envoy) sits between clients and workers. It hashes the session identifier to route requests to the same backend pod, maintaining logical affinity without server-side state. If a pod terminates mid-request, the proxy retries on a healthy node. The retry logic relies on the idempotency key to prevent duplicate processing.

Step 4: Decouple Persistence with an Event Stream

Synchronous database writes are replaced by a fire-and-forget HTTP POST to a Kafka REST proxy (Confluent 7.5). The proxy buffers writes for 10ms before flushing, achieving high throughput with minimal broker backpressure. A separate aggregator service consumes the stream and materializes the session state table every 15 minutes. This shifts the write pattern from high-churn synchronous inserts to batched, append-only operations, eliminating autovacuum freeze windows.

Step 5: Autoscale Dynamically

KEDA monitors the envoy_session_requests_per_second metric exported by the proxy. The Horizontal Pod Autoscaler scales worker pods based on a target of 500 RPS per pod, polling every 15 seconds. When traffic drops, pods scale to zero within 45 seconds, eliminating idle compute costs.

Pitfall Guide

1. Optimizing Storage Before Execution

Explanation: Teams often tune PostgreSQL fsync, adjust Redis pool sizes, or add connection proxies while the runtime scheduler is already thrashing. Storage optimizations cannot compensate for context-switch overhead or allocator stalls. Fix: Profile the scheduler run queue and context switch rate (vmstat 1, pidstat -w) before touching database configuration. If context switches exceed 5k/sec under load, refactor the execution model first.

2. Ignoring Context Switch Tax in Coroutine Runtimes

Explanation: Coroutines are cheap to create but expensive to suspend when I/O latency increases. Each suspension triggers a kernel context switch. At 3,000 sessions with 6ms I/O wait, you can easily exceed 20k switches/sec, saturating CPU steal time. Fix: Bound I/O wait per execution cycle. If latency exceeds 2ms, switch to a process-per-request model or use an async event loop with non-blocking I/O instead of coroutines.

3. Forcing Synchronous Writes in High-Churn Systems

Explanation: Writing to a relational database on every event creates write amplification, triggers autovacuum, and locks tables. Disabling fsync reduces durability guarantees and risks data loss during failovers. Fix: Decouple writes using an event stream. Accept eventual consistency for non-critical state. Materialize aggregates asynchronously to keep the primary write path under 15ms.

4. Misconfiguring Proxy Retry Policies for Stateless Workers

Explanation: Stateless workers exit after execution. If the proxy retries a request without idempotency checks, duplicate events corrupt state. Conversely, aggressive timeouts cause premature retries. Fix: Implement idempotency keys at the worker level. Configure the proxy with exponential backoff and a maximum retry window that aligns with worker lifecycle duration (<100ms).

5. Overlooking Bytecode Compilation Overhead at Cold Start

Explanation: Interpreting scripts on every worker spawn adds 5–15ms of latency. At scale, this compounds and violates P99 SLAs. Fix: Pre-compile all business logic to bytecode during CI/CD. Validate bytecode integrity via hash checks at startup. Never ship source code to production workers.

6. Assuming Eventual Consistency is Transparent to Clients

Explanation: Clients expect immediate feedback. If the materialized view updates every 15 minutes, users may see stale progress indicators. Fix: Return a deterministic acknowledgment immediately (e.g., event_queued: true). Use optimistic UI updates on the client side. Provide a separate polling endpoint for real-time state if required.

Production Bundle

Action Checklist

Profile scheduler metrics: Verify context switch rate and CPU steal time before optimizing storage.
Compile business logic to bytecode: Eliminate runtime interpretation overhead and standardize cold starts.
Implement idempotency keys: Use atomic Redis operations to prevent duplicate processing during proxy retries.
Decouple writes to an event stream: Replace synchronous DB inserts with async HTTP/Kafka ingestion.
Configure consistent hashing: Maintain session affinity at the proxy layer without server-side state.
Tune KEDA polling intervals: Align scaler checks with traffic patterns to avoid thrashing during spikes.
Validate consumer lag: Monitor aggregator throughput to ensure materialized views stay within acceptable freshness windows.

Decision Matrix

Scenario	Recommended Approach	Why	Cost Impact
<500 concurrent sessions, low I/O latency	Coroutine-per-request	Simpler codebase, lower infra overhead	Low (single process)
500–5,000 sessions, high I/O variance	Stateless short-lived workers	Eliminates scheduler thrashing, bounds memory	Medium (container orchestration)
>5,000 sessions, bursty traffic	Serverless functions (e.g., AWS Lambda)	Native scale-to-zero, no pod management	High (per-invocation pricing)
Strict ACID requirements per event	Synchronous DB writes + connection pooling	Guarantees immediate consistency	High (DB scaling, licensing)
Eventual consistency acceptable	Async event stream + materialized view	Maximizes write throughput, reduces DB load	Low-Medium (stream infra + aggregator)

Configuration Template

# keda-scaler.yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: session-worker-scaler
spec:
  scaleTargetRef:
    name: session-worker-deployment
  pollingInterval: 15
  cooldownPeriod: 30
  minReplicaCount: 0
  maxReplicaCount: 1500
  triggers:
    - type: prometheus
      metadata:
        serverAddress: http://prometheus.monitoring:9090
        metricName: envoy_session_requests_per_second
        threshold: "500"
        query: sum(rate(envoy_http_downstream_rq_total{envoy_http_conn_manager_prefix="session_ingress"}[1m]))

---
# envoy-config.yaml (snippet)
static_resources:
  listeners:
    - name: session_listener
      address:
        socket_address: { address: 0.0.0.0, port_value: 8080 }
      filter_chains:
        - filters:
            - name: envoy.filters.network.http_connection_manager
              typed_config:
                route_config:
                  name: local_route
                  virtual_hosts:
                    - name: workers
                      domains: ["*"]
                      routes:
                        - match: { prefix: "/process" }
                          route:
                            cluster: worker_pool
                            hash_policy:
                              - header: { header_name: "x-session-id" }
                http_filters:
                  - name: envoy.filters.http.router
                    typed_config: {}

Quick Start Guide

Containerize the worker: Build a multi-stage Docker image using luajit/luajit:2.1-alpine as the base. Pre-compile Lua scripts to bytecode, copy them into /opt/bin/, and set the entrypoint to execute the binary with arguments.
Deploy the proxy: Install Envoy with consistent hashing enabled on the session header. Configure upstream health checks with a 2-second interval and 2 retry attempts.
Initialize the event stream: Provision a Kafka topic with 6 partitions. Deploy the Confluent REST proxy and configure a 10ms write buffer. Verify ingestion via curl -X POST with a test payload.
Attach the scaler: Apply the KEDA ScaledObject manifest. Point it to your Prometheus instance. Validate scaling by generating synthetic traffic with wrk -t4 -c300 -d30s http://envoy:8080/process.
Monitor convergence: Track P99 latency, context switch rate, and Kafka consumer lag. Adjust the KEDA threshold and proxy retry window until P99 stabilizes below 100ms under load.

The Gamedev Server That Broke at 300 Concurrent Hunters and How We Fixed It