← Back to Blog
AI/ML2026-05-13·87 min read

Two engines for AI slide decks: HTML output vs gpt-image-2 (and how we solved CJK rendering)

By 汪小春

Dual-Rendering Architectures for Multilingual AI Presentations: Balancing Fidelity and Editability

Current Situation Analysis

AI-powered presentation generators have rapidly matured, but they share a critical blind spot: multilingual typography. Most production pipelines follow a linear pattern. An LLM extracts or generates structured content, a template engine applies styling, and the output is delivered as HTML or a document format. This workflow performs predictably for Latin-script languages because text rendering relies on well-established font stacks, CSS layout engines, and predictable glyph metrics.

When the pipeline encounters CJK (Chinese, Japanese, Korean) character sets, the architecture fractures. The failure manifests in two distinct ways depending on the rendering strategy:

  1. DOM/HTML Rendering: Browsers resolve missing glyphs through system font fallback chains. When a presentation template specifies a custom sans-serif or display font, that font rarely contains comprehensive CJK coverage. The browser silently substitutes missing characters with a fallback font (often Noto Sans CJK or a system default). The result is typographic fragmentation: headlines render in the designed typeface, while body text or CJK phrases shift to a visually disjointed fallback. Consistency breaks, line heights miscalculate, and export quality degrades.
  2. Generative Image Rendering: Skipping the DOM entirely and prompting an image model to render slides introduces a different failure mode. Historically, diffusion and transformer-based image models hallucinate non-Latin scripts. They produce visually plausible character shapes that fail Unicode validation, mix radicals incorrectly, or generate entirely fictional glyphs. The output looks like text but carries no semantic or typographic accuracy.

This gap is frequently overlooked because product teams optimize for English-first UX, assuming that CSS font-family declarations or prompt engineering will bridge the multilingual divide. In practice, neither approach scales. Font fallbacks cannot guarantee visual consistency across operating systems, and general-purpose image models lack the character-level precision required for professional CJK typography. The technical debt compounds when teams force a single rendering pipeline to handle both editable text and high-fidelity visual output.

WOW Moment: Key Findings

The architectural breakthrough comes from recognizing that no single rendering engine satisfies both editability and typographic fidelity across language boundaries. Running parallel rendering paths and routing content dynamically resolves the trade-off. The following comparison illustrates why a unified engine fails and why dual-path routing becomes necessary for production-grade multilingual tools.

Approach Render Latency CJK Glyph Accuracy Post-Generation Editability Export Format Flexibility
HTML/CSS Path ~200-400ms per slide Acceptable but inconsistent (font fallback fragmentation) Full text editing, live DOM manipulation Lightweight, vector-based, PPTX text layers preserved
gpt-image-2 Path ~1.5-2.0s per slide (~5x slower) High precision, consistent typography, no fallback gaps None (static raster output) Image-per-slide PPTX, limited vector editing

This data reveals a fundamental constraint: speed and editability trade directly against typographic accuracy and visual freedom. The HTML path enables rapid iteration and downstream editing but sacrifices CJK consistency. The gpt-image-2 path (released April 2026) delivers production-ready CJK rendering and unlimited visual composition but locks content into static images and introduces significant latency. Neither engine alone covers the full product spectrum. Routing content to the appropriate renderer based on language, complexity, and user intent becomes the only viable architecture for multilingual AI slide generation.

Core Solution

The dual-engine architecture replaces the monolithic rendering pipeline with a content-aware router. The system ingests LLM-extracted slide specifications, evaluates metadata (language, layout complexity, editability requirements), and dispatches each slide to the appropriate rendering engine. Below is the implementation strategy, followed by production-grade TypeScript code.

Architecture Overview

  1. Content Extraction: An LLM parses source material (articles, PDFs, transcripts) and outputs a structured JSON schema containing slide titles, body text, language tags, and layout hints.
  2. Routing Decision Engine: A lightweight classifier evaluates each slide's metadata. Criteria include detected language, presence of CJK characters, required visual complexity, and user preference flags.
  3. Parallel Render Dispatch: Slides are batched and sent to either the DOM renderer or the image synthesis API. The router maintains a unified slide index to preserve deck ordering.
  4. Assembly & Export: Rendered outputs are merged into a single presentation container. HTML slides remain editable; image slides are embedded as high-resolution assets. Export handlers normalize the mixed format into PPTX or PDF.

Implementation (TypeScript)

The following implementation demonstrates a production-ready router with dual renderers. Variable names, interfaces, and control flow are rewritten to reflect enterprise patterns while preserving the original functionality.

// types.ts
export interface SlideSpec {
  id: string;
  title: string;
  content: string;
  language: 'en' | 'zh' | 'ja' | 'ko' | 'mixed';
  requiresEditability: boolean;
  layoutHint: 'text-heavy' | 'visual-heavy' | 'chart';
}

export interface RenderResult {
  slideId: string;
  engine: 'dom' | 'image';
  output: string; // HTML string or base64 image data
  metadata: Record<string, unknown>;
}

export type RenderEngine = (spec: SlideSpec) => Promise<RenderResult>;

// router.ts
import { SlideSpec, RenderResult, RenderEngine } from './types';

export class PresentationRouter {
  private domRenderer: RenderEngine;
  private imageRenderer: RenderEngine;
  private defaultEngine: 'dom' | 'image';

  constructor(domEngine: RenderEngine, imageEngine: RenderEngine, defaultEngine: 'dom' | 'image' = 'dom') {
    this.domRenderer = domEngine;
    this.imageRenderer = imageEngine;
    this.defaultEngine = defaultEngine;
  }

  public async routeAndRender(slides: SlideSpec[]): Promise<RenderResult[]> {
    const tasks = slides.map(async (slide) => {
      const selectedEngine = this.selectEngine(slide);
      const executor = selectedEngine === 'dom' ? this.domRenderer : this.imageRenderer;
      return executor(slide);
    });

    return Promise.all(tasks);
  }

  private selectEngine(slide: SlideSpec): 'dom' | 'image' {
    // Priority 1: User explicitly requests image rendering for CJK fidelity
    if (slide.language === 'zh' || slide.language === 'ja' || slide.language === 'ko') {
      return 'image';
    }
    // Priority 2: Visual-heavy layouts benefit from image synthesis
    if (slide.layoutHint === 'visual-heavy' || slide.layoutHint === 'chart') {
      return 'image';
    }
    // Priority 3: Fallback to default
    return this.defaultEngine;
  }
}

// engines.ts
export const createDomRenderer = (): RenderEngine => {
  return async (spec: SlideSpec): Promise<RenderResult> => {
    // Simulates Slidev/Reveal.js template compilation
    const html = `
      <section data-id="${spec.id}">
        <h1>${spec.title}</h1>
        <p>${spec.content}</p>
      </section>
    `;
    return {
      slideId: spec.id,
      engine: 'dom',
      output: html,
      metadata: { renderTime: Date.now(), editable: true }
    };
  };
};

export const createImageRenderer = (apiKey: string): RenderEngine => {
  return async (spec: SlideSpec): Promise<RenderResult> => {
    // Simulates gpt-image-2 API call with structured prompt
    const prompt = `Slide layout: ${spec.layoutHint}. Title: "${spec.title}". Content: "${spec.content}". Style: professional presentation.`;
    
    // In production, this calls OpenAI's image generation endpoint
    const base64Image = await simulateImageGeneration(prompt);
    
    return {
      slideId: spec.id,
      engine: 'image',
      output: base64Image,
      metadata: { renderTime: Date.now(), editable: false, model: 'gpt-image-2' }
    };
  };
};

// Utility for simulation
async function simulateImageGeneration(prompt: string): Promise<string> {
  await new Promise(res => setTimeout(res, 1500)); // ~5x latency simulation
  return `data:image/png;base64,${Buffer.from(prompt).toString('base64')}`;
}

Architecture Decisions & Rationale

  • Per-Slide Routing Over Per-Deck Selection: The original implementation forced users to choose one engine for an entire deck. This granularity fails in practice. A single deck often contains mixed requirements: an agenda slide needs live editing, while a CJK headline or complex chart demands typographic precision. Routing at the slide level preserves editability where it matters and enforces fidelity where it's required.
  • Metadata-Driven Classification: Language detection alone is insufficient. Layout hints, content density, and explicit user flags provide a more robust routing signal. The classifier weights CJK presence highest, followed by visual complexity, then falls back to user preference.
  • Parallel Execution: Rendering engines operate independently. The router batches slides and executes them concurrently, reducing total deck generation time. DOM slides resolve quickly; image slides queue asynchronously. A unified promise resolver maintains deck order regardless of individual render latency.
  • Export Normalization: Mixed-format decks require careful PPTX composition. Text slides export as editable PowerPoint shapes; image slides export as embedded raster assets. The export handler must preserve slide indices, apply consistent master layouts, and inject accessibility metadata for image-based content.

Pitfall Guide

1. Assuming CSS Font Fallback Solves CJK Consistency

Explanation: Declaring font-family: 'CustomSans', 'Noto Sans CJK', sans-serif; does not guarantee visual harmony. Browsers apply fallbacks at the character level, causing mid-line font switches. Line heights, letter spacing, and baseline alignment shift, breaking layout precision. Fix: Preload a comprehensive CJK font stack via @font-face with font-display: swap. Alternatively, route CJK-heavy slides to the image engine to bypass DOM font resolution entirely.

2. Hardcoding Engine Selection at the Deck Level

Explanation: Forcing users to pick one renderer for the entire presentation ignores the heterogeneous nature of slide content. Text-heavy slides lose editability; image-heavy slides become impossible to correct post-generation. Fix: Implement slide-level routing with explicit override flags. Allow users to lock specific slides to a renderer while leaving others on auto-detection.

3. Ignoring Latency Disparity in Synchronous UI Flows

Explanation: The image synthesis path runs approximately 5x slower than DOM rendering. Blocking the UI thread or waiting for all slides to render before displaying a preview creates a degraded user experience. Fix: Stream renders incrementally. Display DOM slides immediately, then progressively replace placeholders with image outputs as they resolve. Implement skeleton loaders and optimistic UI updates.

4. Mishandling PPTX Export with Mixed Render Types

Explanation: Exporting a deck that contains both editable text slides and image slides requires careful binary composition. Naive implementations either rasterize the entire deck (losing editability) or fail to embed images correctly, breaking PowerPoint compatibility. Fix: Use a PPTX library that supports mixed content types. Map DOM slides to TextBody and Shape elements, and image slides to ImagePart references. Maintain a consistent slide master and apply uniform margins.

5. Over-Prompting the Image Model with Layout Instructions

Explanation: Supplying verbose CSS-like layout descriptions to gpt-image-2 degrades output quality. Image models respond better to structural cues and content hierarchy than to pixel-perfect positioning instructions. Fix: Simplify prompts to content + style + layout category. Let the model handle internal composition. Example: "Professional slide, title and two bullet points, clean white background, corporate style."

6. Neglecting Accessibility for Image-Rendered Slides

Explanation: Rasterized slides are invisible to screen readers and break keyboard navigation. Presentations exported as images fail WCAG compliance and limit downstream usability. Fix: Attach structured metadata to each image slide. Inject alt text, slide titles, and content summaries into the PPTX or HTML container. Provide a parallel text-only export option for accessibility compliance.

7. Failing to Implement Graceful Degradation for API Rate Limits

Explanation: Image generation APIs enforce strict rate limits and quota thresholds. A sudden spike in deck generation requests can trigger 429 errors, halting the entire pipeline. Fix: Implement exponential backoff with jitter, queue image requests, and fall back to the DOM renderer when quotas are exhausted. Cache frequently used slide templates to reduce redundant API calls.

Production Bundle

Action Checklist

  • Implement slide-level routing logic with language, layout, and editability metadata
  • Preload comprehensive CJK font stacks and verify fallback behavior across target OS/browser combinations
  • Integrate gpt-image-2 API with retry logic, rate limit handling, and quota monitoring
  • Build incremental UI rendering to stream DOM slides while image slides process asynchronously
  • Design a PPTX export handler that preserves mixed content types and maintains slide order
  • Attach accessibility metadata to image-rendered slides and provide text-only export fallbacks
  • Establish monitoring for render latency, API error rates, and font fallback frequency
  • Conduct cross-platform typography testing with real CJK content before production rollout

Decision Matrix

Scenario Recommended Approach Why Cost Impact
English-only internal draft HTML/CSS Path Fast iteration, full editability, minimal API cost Low (compute only)
Client-facing CJK presentation gpt-image-2 Path Guarantees typographic accuracy, professional visual consistency Medium (~5x latency, API token cost)
Mixed-language deck with heavy charts Per-Slide Routing Preserves editability for text, enforces fidelity for visuals/CJK Medium (balanced API + compute)
Accessibility-compliant export HTML/CSS Path + Alt Metadata Screen reader compatibility, WCAG alignment, searchable text Low (metadata overhead only)
High-volume batch generation HTML/CSS Path with CJK font preloading Avoids API rate limits, scales horizontally, predictable latency Low (CDN + compute)

Configuration Template

// config/presentation.config.ts
import { PresentationRouter, createDomRenderer, createImageRenderer } from '../router';

export const presentationConfig = {
  routing: {
    defaultEngine: 'dom' as const,
    cjkThreshold: 0.15, // Route to image if >15% CJK characters
    layoutOverrides: {
      'visual-heavy': 'image',
      'chart': 'image',
      'text-heavy': 'dom'
    },
    userOverrides: true // Allow manual slide-level engine selection
  },
  rendering: {
    dom: {
      templateEngine: 'slidev-compat',
      fontStack: ['Inter', 'Noto Sans SC', 'Noto Sans JP', 'Noto Sans KR', 'system-ui'],
      preloadFonts: true
    },
    image: {
      model: 'gpt-image-2',
      apiKey: process.env.OPENAI_API_KEY,
      maxRetries: 3,
      retryDelayMs: 1000,
      promptTemplate: 'Slide: {layout}. Title: "{title}". Content: "{content}". Style: {style}.'
    }
  },
  export: {
    format: 'pptx',
    mixedContentHandling: 'preserve',
    accessibility: {
      injectAltText: true,
      generateTextFallback: true
    }
  }
};

export const router = new PresentationRouter(
  createDomRenderer(),
  createImageRenderer(presentationConfig.rendering.image.apiKey),
  presentationConfig.routing.defaultEngine
);

Quick Start Guide

  1. Initialize the Router: Import the configuration and instantiate PresentationRouter with both renderers. Set the default engine based on your primary use case.
  2. Define Slide Metadata: Structure your LLM output to include language, layoutHint, and requiresEditability flags. This enables accurate routing without manual intervention.
  3. Execute Parallel Rendering: Pass your slide array to router.routeAndRender(). The system will dispatch slides concurrently, resolve outputs, and maintain deck order.
  4. Handle Export & Accessibility: Feed the render results into your PPTX or HTML export handler. Verify that image slides include alt text and that mixed content types render correctly in target applications.