Single-GPU Video Synthesis: A Sequential Architecture for Multi-Model Pipelines

Current Situation Analysis

Modern generative video stacks are architecturally fragmented. Developers typically chain together a language model for scripting, a diffusion model for keyframes, an image-to-video transformer for motion, and separate audio engines for voice and music. The industry standard response to VRAM constraints has been horizontal scaling: distributing each component across multiple GPUs or cloud instances. This approach introduces severe operational friction. Network latency between nodes, synchronization overhead, and complex dependency management turn creative iteration into infrastructure management.

The problem is overlooked because most frameworks assume parallel execution is mandatory. Engineers rarely consider that high-bandwidth memory (HBM) architectures can support sequential model execution with aggressive memory reclamation. When models are loaded, executed, and fully evicted in a tightly controlled pipeline, a single accelerator with sufficient memory capacity can outperform distributed clusters for batch-oriented creative workloads.

Data from recent production deployments on the AMD Instinct MI300X demonstrates this clearly. The card's 192 GB HBM3 memory pool allows a 35B Mixture-of-Experts director, a 4B diffusion keyframe generator, a 14B image-to-video MoE, a 3.5B music synthesizer, and an 82M text-to-speech engine to share the same silicon sequentially. By implementing strict memory lifecycle management and dual-role model loading, end-to-end generation for a 30-second cinematic reel dropped from 25.9 minutes to 10.4 minutes per 720p clip. Every component in this stack operates under permissive licensing (Apache 2.0 or MIT), removing commercial deployment barriers that plague proprietary alternatives.

WOW Moment: Key Findings

The architectural shift from distributed parallelism to sequential HBM-optimized execution yields measurable advantages across infrastructure, development velocity, and output consistency.

Approach	VRAM Overhead	Setup Complexity	End-to-End Latency (720p)	Licensing Flexibility
Multi-GPU Distributed	High (replicated caches, NCCL overhead)	Complex (node sync, network routing, dependency hell)	25–30 min	Fragmented (mixed commercial/open)
Single MI300X Sequential	Minimal (strict eviction, dual-role sharing)	Low (single process tree, unified config)	10.4 min	Unified (Apache 2.0/MIT)

This finding matters because it decouples creative iteration from infrastructure provisioning. Developers can prototype, debug, and deploy high-fidelity video pipelines on a single workstation or cloud instance without negotiating cross-node communication or managing distributed training artifacts. The sequential approach also naturally enforces deterministic execution order, making failure isolation and quality gating significantly easier to implement.

Core Solution

Building a reliable single-GPU cinematic pipeline requires treating memory as a finite resource and models as interchangeable workers. The architecture follows a strict linear progression with explicit lifecycle boundaries.

1. Prompt Decomposition & Scene Planning

A 35B Mixture-of-Experts model acts as the narrative director. It parses a single input sentence into a shot list, character descriptions, camera directions, and locale metadata. To prevent memory fragmentation, this component runs in an isolated subprocess. When the subprocess terminates, the operating system reclaims the entire memory footprint, guaranteeing a clean slate for the next phase.

2. Keyframe Generation & Identity Anchoring

Instead of training per-character LoRA adapters (which requires dataset curation, ~90 minutes of compute per character, and introduces style drift), the pipeline leverages reference editing. A master portrait is generated once per character. Subsequent keyframes condition on this reference using FLUX.2 klein's built-in attention routing. This preserves facial structure, clothing, and proportions across shots without fine-tuning. The 4B diffusion model generates keyframes in under a second each, replacing slower alternatives that previously bottlenecked the pipeline.

3. Motion Synthesis & Flow Control

Image-to-video conversion uses a 14B MoE transformer. Motion intensity is controlled via flow shift parameters: 5.0 for hero shots requiring subtle, cinematic movement, and 8.0 for background or establishing shots where broader motion is acceptable. Early experiments with 12.0 introduced plastic skin artifacts and temporal inconsistency, establishing a hard upper bound for production use.

4. Quality Gating & Automated Retry

A vision critic evaluates every rendered clip before it advances to audio layering. The same 35B checkpoint used for direction is reloaded with a diagnostic system prompt, saving approximately 70 GB of VRAM that would otherwise be consumed by a separate model instance. The critic scores each clip across four axes (composition, motion coherence, character consistency, lighting fidelity) on a 1–10 scale. Clips scoring below 7 trigger an automated retry loop (maximum 3 attempts).

The critic returns one of ten enumerated failure labels:

CHARACTER_DRIFT → strengthens reference conditioning
EXTRAS_INVADE_FRAME → tightens prompt negative constraints
CAMERA_IGNORED → simplifies motion verbs in the prompt
WALKING_BACKWARDS → enforces temporal direction flags
OBJECT_MORPHING → reduces flow shift by 1.0
HAND_FINGER_ARTIFACT → adds explicit anatomy constraints
WARDROBE_DRIFT → re-injects clothing descriptors
NEON_GLOW_LEAK → adjusts lighting temperature
STYLIZED_AI_LOOK → increases realism weighting
RANDOM_INTIMACY → resets character proximity parameters

This routing adds ~30% wall time but dramatically reduces manual revision cycles.

5. Audio Layering & Locale Alignment

Music generation uses a 3.5B model conditioned on scene mood and pacing. Voice-over leverages an 82M TTS engine supporting nine languages. The director automatically selects the narration language based on locale metadata (e.g., Tokyo → Japanese, Paris → French). Per-shot WAV files are aligned to clip start offsets using ffmpeg's adelay filter, then mixed with the music bed at -18 LUFS to maintain broadcast-safe dynamic range.

Implementation Architecture

import subprocess
import torch
import gc
from pathlib import Path
from typing import Dict, List, Optional

class SequentialModelManager:
    def __init__(self, device: str = "cuda:0"):
        self.device = device
        self.active_model: Optional[torch.nn.Module] = None
        self.model_registry: Dict[str, Path] = {}

    def load_model(self, model_id: str, loader_fn):
        self._evict_current()
        self.active_model = loader_fn(model_id, self.device)
        return self.active_model

    def _evict_current(self):
        if self.active_model is not None:
            del self.active_model
            self.active_model = None
        gc.collect()
        torch.cuda.empty_cache()

class VisionGate:
    FAILURE_THRESHOLDS = {
        "CHARACTER_DRIFT": "reference_strength",
        "CAMERA_IGNORED": "motion_complexity",
        "OBJECT_MORPHING": "flow_shift",
        "HAND_FINGER_ARTIFACT": "anatomy_weight",
        "WARDROBE_DRIFT": "clothing_descriptor",
        "EXTRAS_INVADE_FRAME": "negative_prompt",
        "WALKING_BACKWARDS": "temporal_direction",
        "NEON_GLOW_LEAK": "lighting_temp",
        "STYLIZED_AI_LOOK": "realism_weight",
        "RANDOM_INTIMACY": "proximity_constraint"
    }

    def evaluate(self, clip_tensor: torch.Tensor, critic_model) -> Dict:
        scores = critic_model.score_clip(clip_tensor, axes=["composition", "motion", "identity", "lighting"])
        overall = sum(scores.values()) / len(scores)
        if overall < 7.0:
            failure_label = critic_model.diagnose_failure(clip_tensor)
            return {"pass": False, "score": overall, "failure": failure_label}
        return {"pass": True, "score": overall}

class PipelineEngine:
    def __init__(self, config: Dict):
        self.mem_mgr = SequentialModelManager()
        self.vision_gate = VisionGate()
        self.config = config
        self.output_dir = Path(config["output_path"])

    def execute_shot(self, shot_prompt: Dict) -> Path:
        # Phase 1: Keyframes
        flux_model = self.mem_mgr.load_model("FLUX2_KLEIN", self._load_flux)
        keyframe = flux_model.generate(shot_prompt["reference"], shot_prompt["text"])
        
        # Phase 2: Motion
        wan_model = self.mem_mgr.load_model("WAN2_2_I2V", self._load_wan)
        flow_shift = 5.0 if shot_prompt["type"] == "hero" else 8.0
        raw_clip = wan_model.synthesize(keyframe, flow_shift=flow_shift)
        
        # Phase 3: Quality Gate
        critic_model = self.mem_mgr.load_model("QWEN35_VISION", self._load_critic)
        attempts = 0
        while attempts < 3:
            result = self.vision_gate.evaluate(raw_clip, critic_model)
            if result["pass"]:
                break
            shot_prompt = self._apply_retry_strategy(shot_prompt, result["failure"])
            raw_clip = wan_model.synthesize(keyframe, flow_shift=flow_shift)
            attempts += 1
            
        # Phase 4: Audio & Assembly
        audio_path = self._compose_audio(shot_prompt)
        final_clip = self._mux_video_audio(raw_clip, audio_path)
        return final_clip

    def _apply_retry_strategy(self, prompt: Dict, failure: str) -> Dict:
        strategy = VisionGate.FAILURE_THRESHOLDS.get(failure)
        if strategy == "flow_shift":
            prompt["flow_shift"] = max(3.0, prompt.get("flow_shift", 5.0) - 1.0)
        elif strategy == "reference_strength":
            prompt["ref_weight"] = min(1.0, prompt.get("ref_weight", 0.7) + 0.15)
        return prompt

The architecture prioritizes memory safety over parallelism. Each model is loaded, executed, and explicitly destroyed before the next phase begins. The vision critic shares weights with the director, eliminating redundant VRAM allocation. Retry logic is deterministic and parameter-driven, avoiding black-box regeneration.

Pitfall Guide

1. FP8 Cross-Attention Segfaults

Explanation: Quantizing the image-to-video transformer to FP8 using AITER triggers a segmentation fault during cross-attention computation (M=512, K=4096, N=5120). The kernel fails to handle the specific tensor layout in the full pipeline graph. Fix: Stick to BF16 for the motion synthesis phase. The performance penalty is negligible compared to pipeline crashes, and BF16 maintains temporal consistency across frames.

2. Full-Graph Compilation Instability

Explanation: Enabling torch.compile(mode="max-autotune", fullgraph=True) on dual-expert MoE transformers causes Dynamo tracing errors. The dynamic routing between experts breaks static graph assumptions. Fix: Compile only the second transformer block (torch.compile(transformer_2)). This yields a 1.2× speedup while avoiding graph capture failures. Leave the first block in eager mode.

3. Tensor Rank Mismatch with `channels_last`

Explanation: The image-to-video transformer operates on rank-5 tensors (batch, frames, channels, height, width). PyTorch's channels_last memory format only supports rank-4 tensors. Forcing the format causes silent dimension misalignment. Fix: Disable channels_last entirely for video transformers. Use contiguous memory layouts and rely on ROCm's optimized convolution kernels instead.

4. Calibration Counter Drift in MagCache

Explanation: MagCache's calibration counter fails to trigger on dual-transformer schedules because the diffusers 0.38 implementation expects a single forward pass. The counter never increments, disabling cache optimization. Fix: Bypass MagCache for this architecture. Use ParaAttention FBCache (0.05 threshold) instead, which provides a lossless 2.0× speedup without calibration dependencies.

5. Identity Drift via LoRA Over-Engineering

Explanation: Training per-character LoRA adapters introduces style contamination and requires dataset curation. The additional compute time (~90 minutes per character) delays iteration without guaranteeing consistency across shots. Fix: Use reference editing with FLUX.2 klein. Inject a master portrait into the attention layers for every keyframe. Identity remains stable across shots with zero training overhead.

6. VRAM Fragmentation from Incomplete Cache Clearing

Explanation: Calling del model without explicit garbage collection and CUDA cache eviction leaves fragmented memory pools. Subsequent model loads fail with OOM errors despite sufficient total VRAM. Fix: Always pair model deletion with gc.collect() and torch.cuda.empty_cache(). Run the director in a subprocess to guarantee OS-level memory reclamation on exit.

7. Critic Threshold Misconfiguration

Explanation: Setting the quality gate threshold too high (e.g., 8.5) causes infinite retry loops. Setting it too low (e.g., 5.0) allows artifacts to propagate to audio mixing. Fix: Use 7.0 as the baseline threshold. Tune per-project based on target distribution platform. Log all failure labels to identify systematic prompt weaknesses.

Production Bundle

Action Checklist

Verify HBM capacity: Ensure ≥192 GB available for sequential 35B/14B/4B model stacking
Isolate director execution: Run narrative planning in a subprocess to guarantee memory reclamation
Enable dual-role loading: Reuse the 35B checkpoint for both direction and vision criticism
Configure flow shifts: Set 5.0 for hero shots, 8.0 for background; never exceed 8.0
Disable FP8 quantization: Use BF16 for motion synthesis to prevent cross-attention segfaults
Compile selectively: Apply torch.compile only to the second transformer block
Set critic threshold: Use 7.0 baseline with max 3 retry attempts per shot
Mix audio at -18 LUFS: Ensure broadcast-safe dynamic range before final export

Decision Matrix

Scenario	Recommended Approach	Why	Cost Impact
Rapid prototyping (<10 clips)	Single MI300X sequential	Zero infrastructure setup, deterministic execution	Low (single instance)
High-volume batch (>100 clips)	Multi-GPU distributed	Parallelism reduces wall time despite sync overhead	High (cluster provisioning)
Commercial deployment	Single MI300X + Apache 2.0 models	Unified licensing, no royalty tracking, predictable scaling	Medium (hardware amortization)
Real-time interactive generation	Not recommended	Sequential loading introduces 2–4s phase transitions	N/A

Configuration Template

pipeline:
  device: "cuda:0"
  precision: "bf16"
  output_resolution: "720p"
  max_retries: 3
  quality_threshold: 7.0

models:
  director:
    checkpoint: "Qwen/Qwen3.5-35B-MoE"
    role: "narrative_planner"
    subprocess: true
  critic:
    checkpoint: "Qwen/Qwen3.5-35B-MoE"
    role: "vision_auditor"
    prompt_template: "diagnostic_system.txt"
  keyframe:
    checkpoint: "FLUX.2/klein-4B"
    method: "reference_editing"
    ref_weight: 0.75
  motion:
    checkpoint: "Wan2.2/14B-I2V"
    flow_shift_hero: 5.0
    flow_shift_broll: 8.0
    compile_target: "transformer_2"
  music:
    checkpoint: "ACE-Step/3.5B"
    mood_mapping: "auto"
  voice:
    checkpoint: "Kokoro/82M"
    languages: ["en", "ja", "fr", "hi", "es", "de", "zh", "ko", "pt"]
    lufs_target: -18

optimization:
  paraattention_fbcache: 0.05
  rocml_flags:
    - "hipBLASLt=1"
    - "expandable_segments=1"
    - "MIOpen_Fast=1"
  memory_management:
    gc_collect: true
    cuda_empty_cache: true
    subprocess_isolation: true

Quick Start Guide

Provision hardware: Allocate an AMD Instinct MI300X instance with ≥192 GB HBM3 and ROCm 6.2+ drivers installed.
Install dependencies: Pull the pipeline repository, create a Python 3.11 virtual environment, and install PyTorch 2.4+ with ROCm wheels.
Configure paths: Update the YAML template with your checkpoint directories, output folder, and preferred critic threshold.
Run first generation: Execute python run_pipeline.py --input "A lone astronaut walks through a neon-lit Tokyo alley at dusk" and monitor VRAM usage with rocm-smi. Expect 10–12 minutes for a complete 30-second reel.