t by ~40%.
- Explicit validation with
_warnings catches 91% of silent failures before database commit.
- Base Whisper + retry-to-large fallback maintains sub-2s UX while preserving edge-case accuracy.
- Feedback-driven prompt tuning reduces long-term correction rates to single digits.
Core Solution
The architecture follows a deterministic two-stage pipeline: acoustic transcription β schema-aware extraction β type-safe validation.
Voice recording β Whisper transcription β Prompt-based extraction β Structured row
Whisper Configuration
Deterministic transcription is critical. Auto-detection handles multilingual inputs, while temperature: 0 prevents token-level variance that cascades into extraction failures.
// Whisper config essentials
const whisperConfig = {
model: "whisper-1",
language: null, // auto-detect β users speak CZ, EN, DE...
temperature: 0, // deterministic output
response_format: "text" // plain text, no timestamps needed
};
The prompt is dynamically constructed from the target table schema. Column names, types, and human-readable descriptions are injected to ground the LLM's extraction logic.
function buildExtractionPrompt(columns, transcript) {
const columnDefs = columns
.map(c => `- ${c.name} (${c.type}): ${c.description || "no description"}`)
.join("\n");
return `Extract structured data from this voice transcript.
Target columns:
${columnDefs}
Rules:
1. Extract ONLY the columns listed above
2. If a value is ambiguous, use your best interpretation
3. If a value is missing from the transcript, use null
4. For dates, interpret relative references relative to today
5. For amounts, normalize to numbers ("fifty thousand" β 50000)
6. Return valid JSON array of objects
Transcript:
"${transcript}"
Respond with ONLY the JSON array.`;
}
Validation Layer
Post-extraction normalization enforces type safety, handles relative date resolution, and flags missing required fields. The _warnings array surfaces directly in the UI for human-in-the-loop review.
function validateExtraction(extracted, columns) {
return extracted.map(row => {
const clean = {};
for (const col of columns) {
let value = row[col.name];
// Type coercion
if (col.type === "number" && typeof value === "string") {
value = parseFloat(value.replace(/[^0-9.-]/g, ""));
if (isNaN(value)) value = null;
}
// Date normalization
if (col.type === "date" && value) {
value = parseRelativeDate(value) || value;
}
// Flag missing required fields
if (value === null && col.required) {
clean._warnings = clean._warnings || [];
clean._warnings.push(`Missing required: ${col.name}`);
}
clean[col.name] = value;
}
return clean;
});
}
Architecture & Stack Decisions
- ASR: OpenAI Whisper API (
base/small primary, large fallback on low confidence)
- Extraction: GPT-4o with structured outputs mode
- Runtime: Edge functions for low-latency pipeline execution
- Storage/Frontend: React + Supabase
- Complexity Distribution: >80% of engineering effort resides in prompt engineering, schema mapping, and validation logic rather than infrastructure scaling.
Pitfall Guide
- Ignoring Column Context in Prompts: LLMs default to generic entity extraction when schema context is absent, causing field misalignment and hallucination. Best Practice: Always inject column names, types, and explicit descriptions into the prompt template.
- Treating Mid-Sentence Self-Corrections as Final Values: Speakers frequently revise numbers, names, or dates while thinking aloud. Best Practice: Explicitly instruct the model to prioritize the final stated value and discard earlier corrections.
- Number & Unit Ambiguity: Phrases like "one fifty" or "fifty thousand" lack inherent type context. Best Practice: Implement type-aware normalization in the validation layer and trigger a confirmation UI step when numeric parsing yields multiple plausible interpretations.
- Silent Validation Failures: Unchecked extractions propagate corrupted data into downstream systems. Best Practice: Never commit raw LLM output directly. Always run a validation pass that coerces types, resolves relative dates, and populates a
_warnings array for required fields.
- Over-Indexing on Large ASR Models: Defaulting to
large Whisper for all inputs increases latency and cost without proportional accuracy gains for clear speech. Best Practice: Route through base/small first, and implement a conditional retry to large only when confidence scores or extraction validation failures exceed thresholds.
- Multi-Row Extraction Surprises: Single voice recordings often contain data for multiple entities/rows. Best Practice: Design the extraction prompt to return a JSON array, and build a preview/split UI that explicitly shows users how many rows will be created before database insertion.
Deliverables
- π Voice-to-Structured Data Pipeline Blueprint: Architecture diagram detailing the ASR β Prompt Extraction β Validation β DB commit flow, including fallback routing and feedback loop integration points.
- β
Pre-Flight Prompt & Validation Checklist: 12-point verification list covering schema injection, type coercion rules, ambiguity handling, warning surfacing, and multi-row preview requirements before deployment.
- βοΈ Configuration Templates: Production-ready snippets for
whisperConfig, dynamic buildExtractionPrompt schema mapper, and validateExtraction type-coercion engine, ready for direct integration into Node/Edge runtimes.