The Gamedev Server That Broke at 300 Concurrent Hunters and How We Fixed It
Decoupling Stateful Execution Loops: A Stateless Worker Pattern for High-Concurrency Event Processing
Current Situation Analysis
Real-time event processing systemsâwhether powering live game sessions, collaborative editing tools, or auction platformsâfrequently begin with a coroutine-per-request execution model. This approach is attractive during early development: it abstracts asynchronous I/O into synchronous-looking code, requires minimal infrastructure, and allows non-engineering teams to script business logic quickly. However, this model carries a hidden scaling tax that only surfaces under sustained concurrency.
The fundamental misunderstanding lies in assuming that lightweight coroutines scale linearly with user count. In practice, coroutines are not free. Each suspended coroutine consumes stack memory, holds file descriptors, and participates in the runtime scheduler's run queue. When I/O latency increasesâeven marginallyâthe scheduler must manage thousands of suspended contexts simultaneously. This triggers context-switch storms, CPU steal time spikes, and allocator fragmentation. Teams typically misdiagnose these symptoms as database or cache bottlenecks, spending cycles tuning connection pools, disabling durability flags, or adding read replicas. The actual constraint is the execution boundary itself.
Empirical evidence from production deployments consistently shows this pattern. At approximately 300 concurrent sessions, Redis latency can jump from sub-millisecond to tens of milliseconds due to connection pool saturation. PostgreSQL autovacuum cycles, triggered by high-churn write patterns, introduce freeze windows of 400â600ms on INSERT operations. The combined effect degrades P99 latency past acceptable SLAs (often >1.5s). As concurrency scales to 3,000 sessions, runtime memory consumption can balloon to 8GB RSS, causing the memory allocator to stall and the scheduler to thrash. The system doesn't fail because the database is slow; it fails because the runtime is drowning in suspended state.
WOW Moment: Key Findings
The breakthrough occurs when you shift from a persistent, stateful execution model to an ephemeral, stateless worker pattern. By decoupling the I/O wait from the CPU scheduler and moving persistence to an async event stream, you eliminate context-switch overhead and bound memory usage per request.
| Execution Model | P99 Latency | Context Switches/sec | Memory Footprint (3k sessions) | DB Write Pattern | Scaling Granularity |
|---|---|---|---|---|---|
| Coroutine-Per-Session | 1.8 s | ~20,000 | 8 GB RSS | Synchronous, fsync-bound | Coarse (process-level) |
| Stateless Short-Lived Worker | 72 ms | <500 | <120 MB RSS | Async, batched via event stream | Fine (pod-level, scale-to-zero) |
This finding matters because it transforms a vertically constrained runtime into a horizontally elastic compute layer. The stateless worker model ensures that I/O latency never blocks CPU scheduling, memory usage remains predictable, and infrastructure costs align directly with active request volume rather than peak concurrency. It also enables eventual consistency patterns that are invisible to the client but drastically reduce write amplification.
Core Solution
The architecture replaces long-running coroutines with ephemeral worker processes that execute a single business logic cycle and terminate. The implementation spans five coordinated layers: request routing, worker execution, event ingestion, state materialization, and dynamic scaling.
Step 1: Define the Execution Boundary
Instead of spawning a coroutine that lives for the duration of a session, the system spawns a short-lived process per event. The process receives only the session identifier, loads compiled bytecode, executes the logic, emits an event, and exits. This eliminates scheduler contention and bounds memory to the lifecycle of a single request.
Step 2: Implement the Worker Runtime
Workers run in isolated containers optimized for cold-start speed. The runtime uses LuaJIT 2.1 paired with musl libc to minimize binary size. Business scripts are pre-compiled to bytecode during CI/CD, eliminating JIT compilation overhead at runtime.
Worker Script (TypeScript orchestrator + Lua worker logic)
// orchestrator.ts
import { spawn } from 'child_process';
import { createHash } from 'crypto';
export class SessionOrchestrator {
private readonly workerBinary: string = '/opt/bin/session_proc';
private readonly maxRetries: number = 2;
async dispatch(sessionId: string, payload: Record<string, unknown>): Promise<void> {
const idempotencyKey = createHash('sha256').update(`${sessionId}:${Date.now()}`).digest('hex').slice(0, 12);
for (let attempt = 0; attempt <= this.maxRetries; attempt++) {
await this.executeWorker(sessionId, payload, idempotencyKey);
}
}
private executeWorker(sessionId: string, payload: Record<string, unknown>, key: string): Promise<void> {
return new Promise((resolve, reject) => {
const proc = spawn(this.workerBinary, [sessionId, JSON.stringify(payload), key]);
proc.on('close', (code) => {
if (code === 0) resolve();
else reject(new Error(`Worker exited with code ${code}`));
});
proc.on('error', reject);
});
}
}
-- session_proc.lua (compiled to bytecode)
local ffi = require("ffi")
local redis = require("resty.redis")
local http = require("resty.http")
local session_id = arg[1]
local payload = cjson.decode(arg[2])
local idempotency_key = arg[3]
-- Atomic state check via Redis
local red = redis:new()
red:set_timeout(50)
red:connect("redis.internal", 6379)
local result = red:evalsha(
"return redis.call('SETNX', KEYS[1], ARGV[1])",
1,
"proc:" .. session_id .. ":lock",
idempotency_key
)
if result == 0 then
os.exit(0) -- Already processed
end
-- Execute business logic
local reward = calculate_reward(payload.difficulty, payload.player_level)
-- Emit event via HTTP fire-and-forget
local httpc = http.new()
httpc:request_uri(
"http://kafka-proxy.internal:8080/topics/session_events",
{
method = "POST",
body = cjson.encode({
session_id = session_id,
event_type = "reward_issued",
payload = reward,
ts = ngx.now()
}),
keepalive_timeout = 1000
}
)
os.exit(0)
Architecture Rationale:
- Fork-exec over threads: Eliminates shared memory contention and garbage collection pauses.
- Pre-compiled bytecode: Removes JIT warmup latency, ensuring consistent cold-start times (<10ms).
- Idempotency via Redis SETNX: Guarantees exactly-once semantics despite at-least-once delivery guarantees from the proxy.
Step 3: Route Traffic with Consistent Hashing
A reverse proxy (Envoy) sits between clients and workers. It hashes the session identifier to route requests to the same backend pod, maintaining logical affinity without server-side state. If a pod terminates mid-request, the proxy retries on a healthy node. The retry logic relies on the idempotency key to prevent duplicate processing.
Step 4: Decouple Persistence with an Event Stream
Synchronous database writes are replaced by a fire-and-forget HTTP POST to a Kafka REST proxy (Confluent 7.5). The proxy buffers writes for 10ms before flushing, achieving high throughput with minimal broker backpressure. A separate aggregator service consumes the stream and materializes the session state table every 15 minutes. This shifts the write pattern from high-churn synchronous inserts to batched, append-only operations, eliminating autovacuum freeze windows.
Step 5: Autoscale Dynamically
KEDA monitors the envoy_session_requests_per_second metric exported by the proxy. The Horizontal Pod Autoscaler scales worker pods based on a target of 500 RPS per pod, polling every 15 seconds. When traffic drops, pods scale to zero within 45 seconds, eliminating idle compute costs.
Pitfall Guide
1. Optimizing Storage Before Execution
Explanation: Teams often tune PostgreSQL fsync, adjust Redis pool sizes, or add connection proxies while the runtime scheduler is already thrashing. Storage optimizations cannot compensate for context-switch overhead or allocator stalls.
Fix: Profile the scheduler run queue and context switch rate (vmstat 1, pidstat -w) before touching database configuration. If context switches exceed 5k/sec under load, refactor the execution model first.
2. Ignoring Context Switch Tax in Coroutine Runtimes
Explanation: Coroutines are cheap to create but expensive to suspend when I/O latency increases. Each suspension triggers a kernel context switch. At 3,000 sessions with 6ms I/O wait, you can easily exceed 20k switches/sec, saturating CPU steal time. Fix: Bound I/O wait per execution cycle. If latency exceeds 2ms, switch to a process-per-request model or use an async event loop with non-blocking I/O instead of coroutines.
3. Forcing Synchronous Writes in High-Churn Systems
Explanation: Writing to a relational database on every event creates write amplification, triggers autovacuum, and locks tables. Disabling fsync reduces durability guarantees and risks data loss during failovers.
Fix: Decouple writes using an event stream. Accept eventual consistency for non-critical state. Materialize aggregates asynchronously to keep the primary write path under 15ms.
4. Misconfiguring Proxy Retry Policies for Stateless Workers
Explanation: Stateless workers exit after execution. If the proxy retries a request without idempotency checks, duplicate events corrupt state. Conversely, aggressive timeouts cause premature retries. Fix: Implement idempotency keys at the worker level. Configure the proxy with exponential backoff and a maximum retry window that aligns with worker lifecycle duration (<100ms).
5. Overlooking Bytecode Compilation Overhead at Cold Start
Explanation: Interpreting scripts on every worker spawn adds 5â15ms of latency. At scale, this compounds and violates P99 SLAs. Fix: Pre-compile all business logic to bytecode during CI/CD. Validate bytecode integrity via hash checks at startup. Never ship source code to production workers.
6. Assuming Eventual Consistency is Transparent to Clients
Explanation: Clients expect immediate feedback. If the materialized view updates every 15 minutes, users may see stale progress indicators.
Fix: Return a deterministic acknowledgment immediately (e.g., event_queued: true). Use optimistic UI updates on the client side. Provide a separate polling endpoint for real-time state if required.
Production Bundle
Action Checklist
- Profile scheduler metrics: Verify context switch rate and CPU steal time before optimizing storage.
- Compile business logic to bytecode: Eliminate runtime interpretation overhead and standardize cold starts.
- Implement idempotency keys: Use atomic Redis operations to prevent duplicate processing during proxy retries.
- Decouple writes to an event stream: Replace synchronous DB inserts with async HTTP/Kafka ingestion.
- Configure consistent hashing: Maintain session affinity at the proxy layer without server-side state.
- Tune KEDA polling intervals: Align scaler checks with traffic patterns to avoid thrashing during spikes.
- Validate consumer lag: Monitor aggregator throughput to ensure materialized views stay within acceptable freshness windows.
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|---|---|---|
| <500 concurrent sessions, low I/O latency | Coroutine-per-request | Simpler codebase, lower infra overhead | Low (single process) |
| 500â5,000 sessions, high I/O variance | Stateless short-lived workers | Eliminates scheduler thrashing, bounds memory | Medium (container orchestration) |
| >5,000 sessions, bursty traffic | Serverless functions (e.g., AWS Lambda) | Native scale-to-zero, no pod management | High (per-invocation pricing) |
| Strict ACID requirements per event | Synchronous DB writes + connection pooling | Guarantees immediate consistency | High (DB scaling, licensing) |
| Eventual consistency acceptable | Async event stream + materialized view | Maximizes write throughput, reduces DB load | Low-Medium (stream infra + aggregator) |
Configuration Template
# keda-scaler.yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: session-worker-scaler
spec:
scaleTargetRef:
name: session-worker-deployment
pollingInterval: 15
cooldownPeriod: 30
minReplicaCount: 0
maxReplicaCount: 1500
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus.monitoring:9090
metricName: envoy_session_requests_per_second
threshold: "500"
query: sum(rate(envoy_http_downstream_rq_total{envoy_http_conn_manager_prefix="session_ingress"}[1m]))
---
# envoy-config.yaml (snippet)
static_resources:
listeners:
- name: session_listener
address:
socket_address: { address: 0.0.0.0, port_value: 8080 }
filter_chains:
- filters:
- name: envoy.filters.network.http_connection_manager
typed_config:
route_config:
name: local_route
virtual_hosts:
- name: workers
domains: ["*"]
routes:
- match: { prefix: "/process" }
route:
cluster: worker_pool
hash_policy:
- header: { header_name: "x-session-id" }
http_filters:
- name: envoy.filters.http.router
typed_config: {}
Quick Start Guide
- Containerize the worker: Build a multi-stage Docker image using
luajit/luajit:2.1-alpineas the base. Pre-compile Lua scripts to bytecode, copy them into/opt/bin/, and set the entrypoint to execute the binary with arguments. - Deploy the proxy: Install Envoy with consistent hashing enabled on the session header. Configure upstream health checks with a 2-second interval and 2 retry attempts.
- Initialize the event stream: Provision a Kafka topic with 6 partitions. Deploy the Confluent REST proxy and configure a 10ms write buffer. Verify ingestion via
curl -X POSTwith a test payload. - Attach the scaler: Apply the KEDA
ScaledObjectmanifest. Point it to your Prometheus instance. Validate scaling by generating synthetic traffic withwrk -t4 -c300 -d30s http://envoy:8080/process. - Monitor convergence: Track P99 latency, context switch rate, and Kafka consumer lag. Adjust the KEDA threshold and proxy retry window until P99 stabilizes below 100ms under load.
Mid-Year Sale â Unlock Full Article
Base plan from just $4.99/mo or $49/yr
Sign in to read the full article and unlock all tutorials.
Sign In / Register â Start Free Trial7-day free trial · Cancel anytime · 30-day money-back
