Back to KB
Difficulty
Intermediate
Read Time
8 min

Mahjax: A GPU-Accelerated Mahjong Simulator for Reinforcement Learning in JAX

By Codcompass Team··8 min read

Vectorized Imperfect-Information Environments: Scaling Multi-Agent RL with JAX

Current Situation Analysis

Training reinforcement learning agents for complex, multi-player, imperfect-information games remains one of the most computationally demanding tasks in modern AI research. Games like Riichi Mahjong feature high-dimensional state spaces, stochastic tile distributions, and hidden information across four concurrent agents. These characteristics closely mirror real-world decision systems in finance, logistics, and strategic planning, where outcomes depend on partial observability and probabilistic transitions.

Despite this alignment, the industry has historically sidestepped end-to-end training for such environments. Most research pipelines default to supervised learning from human play logs. The rationale is straightforward: collecting human trajectories is cheaper than building scalable simulators, and imitation learning converges faster on narrow tasks. However, this approach fundamentally caps agent capability. Policies trained on human data inherit human biases, struggle with out-of-distribution states, and cannot discover novel strategies that deviate from established play patterns. True generalization requires tabula rasa training—agents that learn exclusively through self-play and environmental feedback.

The bottleneck is infrastructure. Traditional game simulators are written in Python or C++ with sequential execution models. They mutate internal state, rely on global random number generators, and execute one game instance at a time. When scaled to millions of episodes, CPU-bound simulators become the primary constraint, forcing researchers to either reduce environment complexity or accept prohibitively long training cycles.

Modern GPU architectures can resolve this constraint, but only when environments are designed for massive parallelization. Fully vectorized implementations eliminate Python overhead, enforce state immutability, and leverage tensor cores for batched operations. Benchmarks across eight NVIDIA A100 GPUs demonstrate that properly vectorized environments can sustain throughputs of up to 2 million steps per second under standard (no-red) rules, and 1 million steps per second when red tile variants are introduced. This throughput shift transforms what was once a weeks-long training cycle into a matter of hours, enabling rigorous experimentation with self-play architectures, reward shaping, and multi-agent coordination strategies.

WOW Moment: Key Findings

The transition from sequential CPU simulators to GPU-vectorized JAX environments fundamentally alters the feasibility of tabula rasa reinforcement learning. The following comparison highlights the operational shift:

ApproachThroughput (steps/sec)Training ParadigmState Space HandlingHardware Utilization
CPU-Sequential Simulator~10k–50kSupervised/Human LogsManual/HeuristicLow (<15% GPU)
JAX-Vectorized Engine1M–2MTabula Rasa/Self-PlayNative Tensor OperationsHigh (>85% GPU)

This finding matters because it decouples environment complexity from training feasibility. When throughput scales linearly with GPU count and batch size, researchers can afford to maintain full game state fidelity, implement accurate reward functions, and run large-scale rollouts without sacrificing convergence speed. The ability to process millions of stochastic transitions per second enables stable policy gradient estimation, reduces variance in multi-agent credit assignment, and makes self-play training economically viable for academic and industrial labs alike.

Core Solution

Building a GPU-accelerated environment requires abandoning traditional object-oriented state management in favor of pure functional programming and explicit tensor threading. The architecture rests on three pillars: immutable stat

🎉 Mid-Year Sale — Unlock Full Article

Base plan from just $4.99/mo or $49/yr

Sign in to read the full article and unlock all 635+ tutorials.

Sign In / Register — Start Free Trial

7-day free trial · Cancel anytime · 30-day money-back