is the control surface. Without explicit constraints, language models default to generic summaries. The architecture uses a conversational format to surface assumptions, enforces a single primary root cause to prevent hedging, and isolates a plain-text summary for audio rendering.
from dataclasses import dataclass
from typing import List
@dataclass
class PostmortemPrompt:
system_directive: str
incident_payload: str
output_schema: str
def compile(self) -> str:
return f"""{self.system_directive}
RULES:
1. Simulate a technical review between two senior engineers.
2. Focus exclusively on testing gaps, observability blind spots, and validation failures.
3. Identify exactly one primary root cause. Rank secondary factors but do not equivocate.
4. Provide concrete test recommendations: type, scope, failure injection method, and success criteria.
5. Avoid marketing language, filler, or vague directives.
REQUIRED OUTPUT STRUCTURE:
# Incident Timeline & Blast Radius
# Testing Gap Analysis
# Root Cause Determination
# Prevention & Validation Strategy
# Recommended Test Suite
# AudioSummary
INCIDENT DATA:
{self.incident_payload}
"""
def construct_review_prompt(raw_incident: str) -> PostmortemPrompt:
directive = (
"You are a principal test architect and production reliability engineer. "
"Your expertise covers distributed systems debugging, chaos engineering, "
"performance bottleneck analysis, and CI/CD pipeline validation. "
"Prioritize evidence-based reasoning, correlate metrics with test coverage, "
"and output strictly technical recommendations."
)
return PostmortemPrompt(
system_directive=directive,
incident_payload=raw_incident,
output_schema="markdown"
)
Why this works: The conversational constraint forces the model to simulate debate, which surfaces hidden assumptions (e.g., "Did anyone validate connection pool behavior under burst traffic?"). The single-cause directive eliminates analysis paralysis. The isolated AudioSummary section ensures clean text-to-speech rendering without markdown artifacts.
Step 2: Local Inference via Ollama
Ollama exposes an OpenAI-compatible endpoint at http://localhost:11434/v1. This allows standard SDK usage without authentication overhead. The client is configured for deterministic output with controlled temperature.
import openai
from openai import OpenAI
class LocalInferenceClient:
def __init__(self, endpoint: str = "http://localhost:11434/v1", model_id: str = "llama3"):
self.client = OpenAI(base_url=endpoint, api_key="local_dummy")
self.model_id = model_id
def generate_analysis(self, prompt: PostmortemPrompt, temperature: float = 0.3) -> str:
messages = [
{"role": "system", "content": prompt.system_directive},
{"role": "user", "content": prompt.compile()}
]
response = self.client.chat.completions.create(
model=self.model_id,
messages=messages,
temperature=temperature,
max_tokens=4096
)
return response.choices[0].message.content.strip()
Architecture rationale: Local execution eliminates data egress, critical for compliance-heavy environments. Temperature is capped at 0.3 to reduce hallucination while preserving technical reasoning. The OpenAI-compatible interface ensures future model swaps (Mistral, Qwen, Llama 3.1) require zero code changes.
Step 3: Robust Output Parsing
LLM formatting is inherently inconsistent. Headings may render as #, ##, or ###, with or without emphasis markers. A naive string split fails across runs. The parser uses a multi-pattern regex with a fallback inference call.
import re
from typing import Optional
class ReportParser:
AUDIO_MARKER = r"#{1,3}\s*\*{0,2}AudioSummary\*{0,2}\s*\n+(.*?)(\n#{1,3}\s|\Z)"
@classmethod
def extract_audio_section(cls, raw_text: str) -> Optional[str]:
match = re.search(cls.AUDIO_MARKER, raw_text, re.DOTALL | re.IGNORECASE)
if match:
return match.group(1).strip()
return None
@classmethod
def fallback_summarization(cls, client: LocalInferenceClient, full_report: str) -> str:
fallback_prompt = (
"Condense the following technical report into a 150-200 word executive summary. "
"Use plain prose only. No markdown, no lists, no headings. "
"Focus on root cause, testing gaps, and prevention steps.\n\n"
f"{full_report}"
)
response = client.client.chat.completions.create(
model=client.model_id,
messages=[{"role": "user", "content": fallback_prompt}],
temperature=0.2
)
return response.choices[0].message.content.strip()
Why this matters: Two-stage extraction guarantees audio generation never fails due to formatting drift. The fallback call uses lower temperature and explicit constraints to ensure consistent prose length and tone.
Step 4: Neural Text-to-Speech Generation
edge-tts provides Microsoft's neural voice synthesis at zero cost. The library is async-native and requires file-based output. The implementation uses a temporary buffer to bridge async generation with sync consumption.
import asyncio
import tempfile
import os
import edge_tts
class AudioRenderer:
@staticmethod
async def _synthesize(text: str, voice_id: str) -> bytes:
communicate = edge_tts.Communicate(text, voice=voice_id)
with tempfile.NamedTemporaryFile(suffix=".mp3", delete=False) as tmp:
tmp_path = tmp.name
try:
await communicate.save(tmp_path)
with open(tmp_path, "rb") as f:
return f.read()
finally:
if os.path.exists(tmp_path):
os.unlink(tmp_path)
@classmethod
def render_to_bytes(cls, text: str, voice_id: str = "en-US-AriaNeural") -> bytes:
return asyncio.run(cls._synthesize(text, voice_id))
Production insight: The temp file approach is unavoidable because edge-tts lacks in-memory buffer support. Cleanup is guaranteed via finally blocks to prevent disk accumulation in high-throughput environments. Voice selection should be configurable; technical content benefits from neutral, clear enunciation (e.g., en-GB-LibbyNeural or en-US-AriaNeural).
Pitfall Guide
1. Unconstrained Root Cause Analysis
Explanation: Without explicit directives, models distribute probability across multiple causes, producing hedged analysis like "several factors likely contributed." This mirrors real-world RCA failure modes.
Fix: Enforce a single primary cause directive in the system prompt. Require ranking of secondary factors with explicit evidence thresholds.
Explanation: String splitting on # AudioSummary fails when models output ## AudioSummary, **AudioSummary**, or add trailing whitespace. Silent failures produce empty audio payloads.
Fix: Use multi-pattern regex with case insensitivity and DOTALL flags. Implement a fallback inference call when extraction returns null.
3. Blocking TTS in Sync Contexts
Explanation: Running async TTS generation in a synchronous web framework or CLI blocks the event loop, causing UI freezes or timeout errors.
Fix: Isolate TTS in a dedicated async runner. Use asyncio.run() for CLI contexts or asyncio.create_task() for web frameworks. Never block the main thread.
4. Context Window Overflow
Explanation: Pasting raw log dumps (10k+ lines) exceeds token limits, causing silent truncation or degraded reasoning quality.
Fix: Pre-filter incident data to include only timestamps, error codes, metric spikes, and deployment events. Strip stack traces and verbose debug output before prompt injection.
5. Ignoring Observability Correlation
Explanation: Models trained on general text lack inherent knowledge of metric-test relationships. Output becomes generic without explicit signal mapping.
Fix: Instruct the prompt to correlate specific alerts (e.g., "HikariPool timeout", "Kafka consumer lag > 5000") with corresponding test types (connection pool saturation tests, backpressure validation).
6. Model Quantization Mismatch
Explanation: Running 70B models on consumer hardware causes OOM errors or extreme latency. Conversely, 3B models lack technical reasoning depth.
Fix: Use 7B-8B parameter models with Q4_K_M quantization for balance. Validate reasoning quality with a known incident before production deployment.
7. Hardcoded Voice Preferences
Explanation: Defaulting to a single voice reduces accessibility and user comfort. Technical content requires clear consonant articulation.
Fix: Expose voice selection as configuration. Maintain a mapping of supported voices with phonetic clarity ratings. Provide fallback to system TTS if neural voices fail.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Highly sensitive infrastructure logs | Local Ollama + Llama 3 | Zero data egress, full compliance | Hardware amortized (~$0.05/run) |
| High-volume incident processing | Cloud API + GPT-4o | Parallel processing, lower latency | ~$0.02-0.05 per 1k tokens |
| Budget-constrained teams | Local Ollama + Qwen 2.5 7B | Free inference, strong technical reasoning | Hardware only |
| Compliance-heavy (HIPAA/FedRAMP) | On-prem LLM + air-gapped TTS | Regulatory alignment, audit trail | Infrastructure + maintenance |
| Rapid prototyping | Cloud API + edge-tts | Fastest setup, no local dependencies | API + free TTS |
Configuration Template
# config.py
import os
from dataclasses import dataclass
@dataclass
class EngineConfig:
# LLM Settings
ollama_endpoint: str = "http://localhost:11434/v1"
model_id: str = "llama3"
inference_temperature: float = 0.3
max_output_tokens: int = 4096
# TTS Settings
voice_id: str = "en-US-AriaNeural"
audio_format: str = "mp3"
# Parsing Settings
fallback_enabled: bool = True
fallback_temperature: float = 0.2
# Runtime
log_sanitization: bool = True
context_window_limit: int = 3500 # Reserve tokens for output
# Usage
config = EngineConfig(
model_id=os.getenv("LLM_MODEL", "llama3"),
voice_id=os.getenv("TTS_VOICE", "en-GB-LibbyNeural")
)
Quick Start Guide
- Install Ollama: Download and run Ollama from the official repository. Verify the service is active on
localhost:11434.
- Pull the Model: Execute
ollama pull llama3 to download the 7B parameter model. Ensure sufficient disk space (~4.7GB).
- Setup Environment: Create a virtual environment, install dependencies (
openai, edge-tts, regex), and configure EngineConfig with your preferred voice and model.
- Run the Pipeline: Pass a raw incident description to
construct_review_prompt(), execute LocalInferenceClient.generate_analysis(), parse with ReportParser, and render audio via AudioRenderer.render_to_bytes(). Output is ready for playback or CI integration.
The architecture transforms incident data from a retrospective artifact into a proactive testing blueprint. By constraining reasoning, preserving data locality, and automating audio delivery, teams close the feedback loop between production failures and validation strategy without compromising compliance or budget.