Stackless Coroutines in C++ for Games: I Rewrote Our AI System Using Them and Here's What I Learned
Deterministic Flow Control in C++20: Replacing Callback Chains with Stackless Coroutines
Current Situation Analysis
Modern game architectures and simulation engines frequently encounter a structural bottleneck: sequential logic that must pause, wait for external conditions, and resume without blocking the main thread. Traditional implementations force developers into one of two patterns, both carrying significant technical debt.
The first pattern relies on explicit state machines. Every behavior is decomposed into discrete states, transitions are managed via enums, and progress is driven by a central update loop polling conditions. While memory-efficient, this approach fractures execution flow. A single AI sequence like patrol β detect_threat β engage β retreat scatters its logic across state definitions, transition tables, and callback registrations. Debugging requires reconstructing execution paths mentally or through extensive logging, which dramatically increases mean time to resolution (MTTR) for complex behavioral bugs.
The second pattern uses callback chains or event-driven architectures. Each step registers a continuation function that fires when a condition is met. This improves readability slightly but introduces severe maintainability issues. Context must be manually threaded through lambda captures or external structs. Error handling becomes fragmented, and stack traces lose their linear relationship to source code. In production environments with hundreds of concurrent entities, callback networks become impossible to trace deterministically.
Many teams attempt to solve this with stackful coroutines (fibers). Fibers provide native call stacks, preserving local variables and call frames across suspension points. However, they require substantial memory reservations. A safe minimum of 64KB per fiber is standard to prevent stack overflows during deep utility calls. When scaling to hundreds of concurrent entities, memory consumption quickly exceeds budgets on constrained platforms. The resulting need for fiber pooling and recycling reintroduces the complexity developers were trying to eliminate.
C++20 stackless coroutines resolve this architectural tension. The compiler mechanically transforms coroutine functions into heap-allocated state machines. Only variables that survive suspension are stored in a coroutine frame, typically consuming 200β500 bytes per task. Execution remains linear in source code, suspension points are explicit, and memory overhead scales predictably. The trade-off is a steeper initial learning curve around promise types, awaitable interfaces, and compiler-specific behavior. Understanding these mechanics is essential before deploying coroutines in performance-critical systems.
WOW Moment: Key Findings
The architectural shift from callback/state-machine patterns to stackless coroutines yields measurable improvements in memory efficiency, debugging velocity, and platform scalability. The following comparison isolates the critical metrics observed when migrating concurrent behavioral systems.
| Approach | Memory Overhead | Debugging Complexity | Platform Scalability | Implementation Friction |
|---|---|---|---|---|
| State Machines | ~0KB per task | High (scattered enums/callbacks) | Excellent | Medium (boilerplate) |
| Stackful Fibers | ~64KB per task | Low (native stack traces) | Poor (VRAM/RAM limits) | High (pool management) |
| C++20 Stackless | ~300B per frame | Low (linear syntax) | Excellent | Medium (promise setup) |
This data demonstrates that stackless coroutines occupy a unique optimization space. They preserve the linear readability of fiber-based approaches while maintaining the memory footprint of state machines. The elimination of callback registration reduces cognitive load during debugging, and the predictable frame size enables safe deployment across memory-constrained hardware. The primary cost shifts from runtime overhead to compile-time configuration and promise type design, which is a one-time engineering investment.
Core Solution
Implementing stackless coroutines requires understanding three compiler-generated components: the promise type, the awaitable interface, and the coroutine handle. The following implementation demonstrates a production-ready task type designed for sequential AI behavior, with explicit memory management and symmetric transfer support.
Step 1: Define the Promise Type with Custom Allocation
The promise type controls coroutine lifecycle, return value handling, and frame allocation. Custom operator new and operator delete prevent heap fragmentation in hot paths.
#include <coroutine>
#include <cstdint>
#include <stdexcept>
struct AI_BehaviorTask {
struct promise_type {
// Custom allocation routes frame creation to engine memory pool
static void* operator new(std::size_t size) {
return Engine_MemoryPool::Allocate(size, alignof(promise_type));
}
static void operator delete(void* ptr) noexcept {
Engine_MemoryPool::Deallocate(ptr);
}
// Constructs the wrapper object returned to the caller
AI_BehaviorTask get_return_object() noexcept {
return AI_BehaviorTask{
std::coroutine_handle<promise_type>::from_promise(*this)
};
}
// Delay execution until caller explicitly resumes
std::suspend_always initial_suspend() noexcept { return {}; }
// Preserve frame after completion for explicit cleanup
std::suspend_always final_suspend() noexcept { return {}; }
// Terminate on unhandled exceptions to prevent silent corruption
void unhandled_exception() noexcept {
std::terminate();
}
// Task returns void; no value storage required
void return_void() noexcept {}
};
std::coroutine_handle<promise_type> handle;
explicit AI_BehaviorTask(std::coroutine_handle<promise_type> h) noexcept
: handle(h) {}
~AI_BehaviorTask() noexcept {
if (handle) handle.destroy();
}
AI_BehaviorTask(const AI_BehaviorTask&) = delete;
AI_BehaviorTask& operator=(const AI_BehaviorTask&) = delete;
AI_BehaviorTask(AI_BehaviorTask&& other) noexcept
: handle(other.handle) { other.handle = nullptr; }
void resume() { if (handle && !handle.done()) handle.resume(); }
bool is_complete() const noexcept { return !handle || handle.done(); }
};
Architecture Rationale:
initial_suspend()returnssuspend_alwaysto prevent immediate execution. This allows the caller to store the task handle before the firstco_awaittriggers.final_suspend()returnssuspend_alwaysto prevent automatic frame destruction. This gives the engine explicit control over cleanup, which is critical when tasks are managed by external systems (e.g., AI managers or object pools).- Custom allocators route frame creation through a pre-allocated memory pool, eliminating
std::mallocoverhead and preventing heap fragmentation during level loads or wave spawns.
Step 2: Implement the Awaitable Interface
Awaitables define suspension points. The following timer awaitable demonstrates symmetric transfer, which prevents stack overflow when chaining thousands of coroutines.
struct AwaitableTimer {
float duration_seconds;
float elapsed = 0.0f;
explicit AwaitableTimer(float seconds) : duration_seconds(seconds) {}
// Fast path: skip suspension if duration is zero or negative
bool await_ready() const noexcept {
return duration_seconds <= 0.0f;
}
// Symmetric transfer: returns handle to enable tail-call optimization
std::coroutine_handle<> await_suspend(
std::coroutine_handle<> awaiting_coro) noexcept {
// Register with engine timer system
Engine_TimerSystem::Register(duration_seconds, [awaiting_coro]() {
awaiting_coro.resume();
});
// Return noop_coroutine to collapse call stack
return std::noop_coroutine();
}
void await_resume() const noexcept {}
};
Architecture Rationale:
await_ready()provides a fast path for zero-duration waits, avoiding unnecessary suspension overhead.await_suspend()returnsstd::coroutine_handle<>instead ofvoid. This triggers symmetric transfer, converting the resume operation into a tail call. Without this, chaining 10,000 coroutines would exhaust the call stack.std::noop_coroutine()terminates the symmetric transfer chain safely, returning control to the scheduler without unwinding through intermediate frames.
Step 3: Compose Sequential Behavior
The coroutine function now reads as linear code while executing asynchronously.
AI_BehaviorTask GuardPatrolSequence(AI_Unit& unit, const WaypointList& route) {
for (const auto& waypoint : route.nodes) {
unit.SetNavigationTarget(waypoint.position);
unit.PlayAnimation("Walk");
// Suspend until navigation completes
co_await unit.AwaitNavigationComplete();
// Pause at waypoint
co_await AwaitableTimer(1.5f);
unit.PlayAnimation("Scan");
co_await AwaitableTimer(0.8f);
}
unit.PlayAnimation("Idle");
co_return;
}
Integration Pattern:
The game loop or AI manager maintains a collection of active handles. Each frame, it calls resume() on pending tasks. Completed tasks are detected via is_complete() and removed from the active set. This decouples coroutine execution from the main thread while maintaining deterministic update ordering.
Pitfall Guide
1. Unbounded Heap Fragmentation
Explanation: Default coroutine frame allocation uses operator new, which fragments the heap when spawning hundreds of tasks per second. This manifests as gradual performance degradation and eventual allocation failures.
Fix: Implement operator new and operator delete inside the promise type. Route allocations through a linear allocator, slab pool, or engine-specific memory manager.
2. Stack Overflow in Coroutine Chains
Explanation: When await_suspend returns void, resuming the next coroutine performs a standard function call. Chaining many coroutines recursively exhausts the call stack.
Fix: Return std::coroutine_handle<> from await_suspend and terminate chains with std::noop_coroutine(). Verify compiler support for symmetric transfer (Clang 11+, GCC 11+, MSVC 17.x+).
3. Burying co_await in Helper Functions
Explanation: co_await cannot appear inside non-coroutine helper functions. Attempting to do so causes compilation errors or forces the helper to become a coroutine, which may not align with the design.
Fix: Extract suspension logic into awaitable structs. Pass awaitables to helpers, or convert the helper itself into a coroutine if it requires suspension points.
4. Ignoring Compiler Version Pinning
Explanation: C++20 coroutine support evolved significantly between 2021 and 2023. Older compilers may compile code without warnings but exhibit undefined behavior at runtime, particularly around promise type lifetimes and symmetric transfer.
Fix: Enforce minimum versions in build configuration. Test a trivial co_return coroutine against new SDK drops before deploying to production targets.
5. Premature Frame Destruction
Explanation: Returning suspend_never from final_suspend() causes immediate frame destruction upon completion. If external systems hold references to the coroutine handle, this results in use-after-free crashes.
Fix: Return suspend_always from final_suspend() and explicitly call handle.destroy() when the task is fully processed by the manager.
6. Exception Safety Gaps
Explanation: Unhandled exceptions inside coroutines bypass standard stack unwinding. Without explicit handling, they cause silent corruption or abrupt termination.
Fix: Implement unhandled_exception() in the promise type. Route exceptions to a centralized error handler or call std::terminate() to fail fast during development.
7. Engine Build System Conflicts
Explanation: Monolithic build systems (e.g., Unreal Engine's unity builds) may aggregate coroutine headers with conflicting promise type definitions or allocator overrides, causing ODR violations. Fix: Isolate coroutine implementations in separate translation units. Compile coroutine-heavy modules outside unity builds. Use thin wrapper headers to expose only the task type to engine code.
Production Bundle
Action Checklist
- Pin compiler versions: MSVC 19.28+, GCC 10+, or Clang 12+
- Configure CMake with
CMAKE_CXX_STANDARD_REQUIRED ONandcxx_std_20 - Implement custom
operator new/deletein promise type for pool allocation - Verify symmetric transfer support on target platform SDK
- Return
suspend_alwaysfromfinal_suspend()and manage handle lifecycle explicitly - Implement
unhandled_exception()to prevent silent corruption - Integrate coroutine resume calls into the main update loop or dedicated scheduler
- Add static assertions to validate promise type layout and alignment
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|---|---|---|
| AI Behavior Sequences | C++20 Stackless Coroutines | Linear syntax reduces debugging time; predictable memory footprint | Medium initial setup, low runtime cost |
| High-Frequency Physics Callbacks | State Machines | Deterministic update ordering; zero allocation overhead | Low memory, high boilerplate |
| Cutscene/Scripting Systems | C++20 Stackless Coroutines | Natural fit for sequential timing and event waiting | Medium setup, excellent maintainability |
| Network Packet Processing | Callback Chains / Async I/O | Event-driven architecture aligns with socket readiness | Low memory, high complexity at scale |
| Memory-Constrained Embedded | State Machines | Zero dynamic allocation; strict memory bounds | Low runtime cost, high development time |
Configuration Template
# CMakeLists.txt
cmake_minimum_required(VERSION 3.20)
project(GameEngine LANGUAGES CXX)
set(CMAKE_CXX_STANDARD 20)
set(CMAKE_CXX_STANDARD_REQUIRED ON)
set(CMAKE_CXX_EXTENSIONS OFF)
add_executable(GameEngine src/main.cpp src/ai_system.cpp)
target_compile_features(GameEngine PRIVATE cxx_std_20)
# Enforce coroutine support verification
target_compile_definitions(GameEngine PRIVATE
COROUTINE_SYMMETRIC_TRANSFER_SUPPORTED=1
COROUTINE_POOL_ALLOCATOR_ENABLED=1
)
// PromiseAllocator.h
#include <coroutine>
#include <cstddef>
struct Engine_MemoryPool {
static void* Allocate(std::size_t size, std::size_t alignment) noexcept;
static void Deallocate(void* ptr) noexcept;
};
template<typename Promise>
struct CoroutineAllocatorMixin {
static void* operator new(std::size_t size) {
return Engine_MemoryPool::Allocate(size, alignof(Promise));
}
static void operator delete(void* ptr) noexcept {
Engine_MemoryPool::Deallocate(ptr);
}
};
Quick Start Guide
- Verify Compiler Support: Run
g++ -std=c++20 -E -dM - < /dev/null | grep __cpp_coroutinesor equivalent for your compiler. Expect a value β₯ 201703L. - Create Minimal Task: Implement a promise type with
initial_suspend(),final_suspend(), andreturn_void(). Returnsuspend_alwaysfor both. - Define Awaitable: Create a struct with
await_ready(),await_suspend(), andawait_resume(). Returnstd::noop_coroutine()fromawait_suspendto enable symmetric transfer. - Integrate Scheduler: Maintain a
std::vector<std::coroutine_handle<>>in your update loop. Callresume()on each handle, remove completed tasks, and destroy handles explicitly. - Validate Memory: Profile frame allocation counts and sizes. Confirm that custom allocators are invoked and heap fragmentation remains stable under load.
Mid-Year Sale β Unlock Full Article
Base plan from just $4.99/mo or $49/yr
Sign in to read the full article and unlock all tutorials.
Sign In / Register β Start Free Trial7-day free trial Β· Cancel anytime Β· 30-day money-back
