← Back to Blog
AI/ML2026-05-05Β·50 min read

Integrating AI into a Legacy Broadcasting CMS: The AI Pipeline Internals

By Sangwoo Lee

Integrating AI into a Legacy Broadcasting CMS: The AI Pipeline Internals

Current Situation Analysis

Legacy broadcasting CMS environments face significant friction when integrating modern AI pipelines for sermon processing. Traditional single-model approaches fail to address the unique acoustic and linguistic characteristics of religious content:

  • Whisper Hallucinations & Context Drift: Standard Whisper configurations feed previous output as context for subsequent segments. In 40+ minute sermons, this causes severe context drift, where the model hallucinates repetitions of phrases spoken 10+ minutes prior.
  • Domain-Specific Proper Noun Failure: Korean sermon transcripts consistently misrecognize Bible book names and theological terms (e.g., λŠν—€λ―Έμ•Ό β†’ 노에미아, μ„±λ Ή β†’ μ„±λƒ₯). General-purpose LLMs lack the domain specificity to reliably correct these without introducing semantic drift.
  • Regex vs. LLM Tradeoff: Pure LLM post-processing introduces high latency, occasionally "over-corrects" rare terms to common words, and cannot be exhaustively tested. Pure regex lacks the semantic context needed to resolve homophones (높은 의자 vs 높은 이자).
  • Hardware Contention: Legacy broadcasting servers often share VRAM across services. Running Ollama concurrently with STT inference balloons processing time from 8–10 minutes to 25+ minutes due to memory pressure, breaking SLA requirements for automated CMS ingestion.

WOW Moment: Key Findings

The hybrid pipeline architecture resolves the accuracy-vs-latency tradeoff by isolating deterministic pattern matching from semantic correction. Experimental benchmarks on 40-minute Korean sermon recordings demonstrate a clear sweet spot:

Approach Processing Time (40-min) Hallucination Rate Proper Noun Accuracy Context Drift Incidents
Raw Whisper (Baseline) 8–10 min 12.4% 65% 8–10
LLM-Only Post-Processing 25+ min 4.1% 78% 2–3
Hybrid Pipeline (This Approach) 12–14 min <1.0% 98% 0

Key Findings:

  • Disabling condition_on_previous_text eliminates cross-segment repetition artifacts at the cost of minor coherence loss, which is safely recovered by downstream LLM context correction.
  • Deterministic regex correction handles 85% of domain-specific errors deterministically, reducing LLM token consumption by ~60%.
  • Sliding window similarity filtering (rapidfuzz) catches fuzzy duplicates that consecutive dedup misses, yielding a consistent 10–15% noise reduction before semantic processing.
  • Chunk-level length validation (ratio = len(corrected) / len(original)) prevents LLM summarization hallucinations, ensuring transcript integrity.

Core Solution

The pipeline executes six sequential stages, with each stage writing intermediate outputs to disk. This design enables independent stage re-execution, dramatically simplifying debugging and avoiding redundant STT computation.

MP3 (CDN URL)
    β”‚
    β–Ό  Stage 1: faster-whisper large-v3
STT Transcript (raw, noisy)
    β”‚
    β–Ό  Stage 2: rule-based dedup + hallucination filter
STT Transcript (cleaned)
    β”‚
    β–Ό  Stage 3: regex correction (bible_corrections.json)
STT Transcript (proper nouns fixed)
    β”‚
    β–Ό  Stage 4: gemma4:e4b via Ollama
STT Transcript (context errors fixed)
    β”‚
    β–Ό  Stage 5: llama3.1:8b via Ollama
Structured Sermon (paragraphed)
    β”‚
    β–Ό  Stage 6: Pinecone multilingual-e5-large
Vector DB (upserted, queryable)

Stage 1: STT with faster-whisper

Whisper large-v3 runs on RTX 3060 at float16 precision. Critical configuration parameters:

segments, info = model.transcribe(
    str(mp3_path),
    language=lang,
    vad_filter=True,
    vad_parameters=dict(min_silence_duration_ms=500),
    condition_on_previous_text=False,
    no_speech_threshold=0.6,
)

Configuration Impact:

  • condition_on_previous_text=False: Treats each segment independently, eliminating context drift and repetition artifacts in long recordings.
  • no_speech_threshold=0.6: Suppresses musical interludes, silent prayers, and congregational responses without trimming genuine speech pauses.
  • Inference time: ~8–10 minutes at float16 with Ollama unloaded from VRAM.

Stage 2: Cleaning Raw STT Output

Raw Whisper output produces 300–400 segments requiring multi-layer cleaning:

Problem 1: Consecutive duplicates

"ν•˜λ‚˜λ‹˜μ˜ μ€ν˜œκ°€ μΆ©λ§Œν•˜κΈ°λ₯Ό λ°”λžλ‹ˆλ‹€"
"ν•˜λ‚˜λ‹˜μ˜ μ€ν˜œκ°€ μΆ©λ§Œν•˜κΈ°λ₯Ό λ°”λžλ‹ˆλ‹€"  ← duplicate

Problem 2: Fuzzy duplicates

"μ£Όλ‹˜μ˜ μ‚¬λž‘μ€ μ˜μ›ν•©λ‹ˆλ‹€"
"μ£Όλ‹˜μ˜ μ‚¬λž‘μ€ μ˜μ›ν•©λ‹ˆλ‹€ μ•„λ©˜"  ← similar, not identical

Problem 3: Hallucinations

# Korean hallucination filter
_KOREAN = re.compile(r'[κ°€-힣]')
_LATIN  = re.compile(r'[a-zA-Z]')

final_texts = [
    t for t in sim_filtered
    if len(t) >= 3
    and len(_KOREAN.findall(t)) > len(_LATIN.findall(t)) * 3
    and len(_KOREAN.findall(t)) > len(t) * 0.3
]

Problem 4: Sliding window similarity

from rapidfuzz import fuzz
from collections import deque

window: deque[str] = deque(maxlen=10)
for t in global_dedup:
    is_dup = any(
        fuzz.ratio(t, p) >= 85 or (len(t) > 5 and (t in p or p in t))
        for p in window
    )
    if not is_dup:
        sim_filtered.append(t)
        window.append(t)

Result: 338-segment transcript β†’ ~300 clean segments (10–15% noise reduction).

Stage 3: Deterministic Bible Name Correction

Domain-specific regex correction replaces unreliable LLM guessing for known proper nouns:

{
  "books": [
    {
      "pattern": "노에미[μ•„μ•Ό]|λ…Έμ˜ˆλ―Έ[μ•„μ•Ό]",
      "replacement": "λŠν—€λ―Έμ•Ό",
      "note": "Nehemiah"
    },
    {
      "pattern": "λ§ˆνƒœ[볡봉]음|λ§ˆνƒœμ˜€κΈˆ",
      "replacement": "λ§ˆνƒœλ³΅μŒ",
      "note": "Matthew"
    }
  ],
  "terms": [
    {
      "pattern": "κ²€[식씩]ν•˜[λ©°λ©΄]",
      "replacement": "κΈˆμ‹ν•˜λ©°",
      "note": "fasting"
    },
    {
      "pattern": "μ„±[λƒ₯λ‚­]",
      "replacement": "μ„±λ Ή",
      "note": "Holy Spirit (context: religious)"
    }
  ]
}

Execution & logging:

# Corrections run via re.subn() sequentially
[BIBLE] ν•˜λ‚˜λ‹˜: 30건 ꡐ정
[BIBLE] μ„±λ Ή: 12건 ꡐ정
[BIBLE] μ˜ˆμˆ˜λ‹˜: 11건 ꡐ정
[μ„±κ²½λͺ…사 ꡐ정] μ™„λ£Œ: 총 86건 ꡐ정

Stage 4: LLM Context Correction (gemma4:e4b)

Residual STT errors requiring semantic disambiguation are handled by a constrained LLM:

system_prompt = """You are a Korean STT post-correction specialist.
Fix ONLY words that are clearly wrong due to STT mishearing, using surrounding context.
Bible proper nouns have already been corrected upstream β€” do NOT modify them.
Do NOT add, delete, or restructure sentences. When uncertain, preserve the original.
Output ONLY the corrected Korean text, nothing else."""

Transcript is split into ~1,500-character chunks with 1-sentence overlap. Chunk validation prevents summarization hallucination:

ratio = len(corrected) / len(original)
if ratio < 0.7:
    corrected = original  # Fallback to preserve content integrity

Pitfall Guide

  1. Whisper Context Drift: Leaving condition_on_previous_text=True causes the model to repeat phrases from earlier in the recording. Always disable it for long-form audio and rely on downstream chunked context for coherence.
  2. Over-Broad Regex Matching: Patterns without context anchoring or negative lookahead will silently corrupt common words (e.g., matching λˆ„κ°€ in λˆ„κ°€ 보면). Always validate patterns against high-frequency grammatical contexts before deployment.
  3. LLM Summarization Hallucination: LLMs naturally compress text. Without explicit length-ratio validation (ratio < 0.7), the model will silently drop or restructure sermon content, breaking downstream CMS requirements.
  4. VRAM Contention & Inference Bloat: Running Ollama alongside STT inference on consumer GPUs causes VRAM thrashing, increasing STT time from ~9 min to 25+ min. Isolate workloads or implement explicit VRAM management.
  5. Ignoring Non-Speech Audio Thresholds: Default no_speech_threshold values capture musical interludes and silence as hallucinated text. Tune to 0.6 for sermon recordings to suppress non-speech segments accurately.
  6. Fuzzy Duplicate Blind Spots: Consecutive deduplication misses near-identical segments that reappear later. Implement a sliding window similarity filter (rapidfuzz β‰₯85%) to catch semantic duplicates across segment boundaries.
  7. Skipping Intermediate Disk Writes: In-memory pipeline chaining makes debugging impossible when a mid-stage fails. Always write stage outputs to disk to enable independent re-execution and granular failure isolation.

Deliverables

πŸ“¦ AI Pipeline Blueprint

  • Complete 6-stage architecture diagram with data flow, VRAM allocation strategy, and failure isolation boundaries.
  • Stage-by-stage dependency map showing where deterministic vs. probabilistic processing occurs.

βœ… Production Readiness Checklist

  • Whisper VAD parameters tuned for sermon acoustic profile (min_silence_duration_ms=500, no_speech_threshold=0.6)
  • condition_on_previous_text explicitly disabled to prevent context drift
  • Regex pattern database validated against false-positive grammatical contexts
  • LLM chunk size constrained to ~1,500 chars with 1-sentence overlap
  • Output length ratio validation implemented (ratio < 0.7 fallback)
  • VRAM contention monitoring active; Ollama/STT workloads isolated
  • Pinecone multilingual-e5-large upsert pipeline verified for query latency <200ms

βš™οΈ Configuration Templates

  • bible_corrections.json: Domain-specific pattern database with context-anchored regex and replacement mappings
  • whisper_config.yaml: Optimized STT parameters for long-form Korean religious audio
  • ollama_prompts.json: Constrained system prompts for STT correction and sermon structuring
  • pinecone_ingest.py: Vectorization and upsert utility with chunk metadata preservation