lock: Target Tables Only
-- Database: PostgreSQL 15
-- Relevant Tables for Revenue Analytics
CREATE TABLE merchant_ledger (
ledger_id UUID PRIMARY KEY,
merchant_fk UUID NOT NULL REFERENCES merchants(merchant_id),
transaction_ts TIMESTAMPTZ NOT NULL,
amount_minor INTEGER NOT NULL,
currency_code CHAR(3) NOT NULL,
ledger_type TEXT NOT NULL, -- 'credit', 'debit', 'refund', 'fee'
status TEXT NOT NULL -- 'pending', 'settled', 'reversed'
);
CREATE TABLE merchants (
merchant_id UUID PRIMARY KEY,
tier TEXT NOT NULL, -- 'basic', 'growth', 'enterprise'
onboarding_dt DATE NOT NULL,
region TEXT NOT NULL -- 'NA', 'EU', 'APAC'
);
**Why this works:** Inline comments on `ledger_type` and `tier` explicitly enumerate valid filter values. Without them, the model might generate `WHERE tier = 'premium'` instead of `WHERE tier = 'growth'`. The `amount_minor` naming convention signals integer-based currency storage, preventing accidental decimal division errors.
### Step 2: Business Lexicon Injection
Natural language queries map poorly to relational schemas. "Active merchants" might mean `onboarding_dt < CURRENT_DATE - INTERVAL '30 days'`. "Net revenue" might exclude fees and refunds. If you don't define these terms, the model will invent definitions that conflict with your finance team's spreadsheets.
```text
Business Glossary:
- "Net Revenue": SUM(amount_minor) WHERE ledger_type IN ('credit') AND status = 'settled', divided by 100.0
- "Active Merchant": merchant with at least one settled credit transaction in the last 90 days
- "Q3 2025": 2025-07-01 00:00:00 UTC to 2025-09-30 23:59:59 UTC
- "Churned": merchant with no settled transactions in the trailing 60-day window
Why this works: Explicit metric definitions eliminate semantic drift. The model no longer guesses aggregation logic or date boundaries. This is the single highest-ROI modification you can make to any text-to-SQL prompt.
Step 3: Style Anchoring via Few-Shot Examples
LLMs adapt to patterns. Providing one or two example question/query pairs teaches the model your naming conventions, date handling preferences, and JOIN strategies. This reduces post-generation cleanup.
Reference Examples:
-- Q: Calculate settled credit volume for EU merchants in August 2025
-- A:
SELECT
m.tier,
SUM(l.amount_minor) / 100.0 AS net_volume_usd
FROM merchant_ledger l
JOIN merchants m ON l.merchant_fk = m.merchant_id
WHERE l.currency_code = 'USD'
AND l.ledger_type = 'credit'
AND l.status = 'settled'
AND m.region = 'EU'
AND l.transaction_ts >= '2025-08-01'
AND l.transaction_ts < '2025-09-01'
GROUP BY m.tier
ORDER BY net_volume_usd DESC;
Why this works: The example establishes conventions: explicit range filtering (>= / < instead of BETWEEN), explicit division for minor units, descending sort order, and consistent aliasing. The model mirrors these patterns in subsequent generations.
Step 4: Chain-of-Thought Decomposition
Complex analytical queries (cohort retention, funnel conversion, rolling averages) fail when requested in a single shot. The model attempts to optimize multiple JOINs, window functions, and conditional aggregations simultaneously, increasing the probability of cardinality errors or incorrect grouping.
Task: Calculate 90-day merchant retention for Q3 2025 onboarding cohort.
Retention Definition: Merchant must have at least one settled credit transaction
in the 90-day window following their onboarding_dt.
Before generating SQL, outline:
1. Which tables and join keys are required?
2. How will you isolate the Q3 2025 cohort?
3. How will you define the 90-day retention window?
4. What aggregation logic prevents double-counting?
Then produce the final query.
Why this works: Forcing explicit reasoning steps mirrors how senior data engineers approach complex queries. The model catches its own logical flaws before emitting SQL. This dramatically reduces JOIN explosion errors and incorrect window function framing.
Step 5: Validation Loop & Edge Case Review
Never trust the first output. Append a validation request to force the model to audit its own logic.
Review the generated query for:
- Null propagation risks in SUM() or COUNT()
- Timezone assumptions in date filtering
- Division by zero scenarios
- Implicit type casting that could truncate decimals
- Missing COALESCE or NULLIF safeguards
List any identified risks and provide the corrected version.
Why this works: LLMs exhibit self-correction capabilities when prompted to evaluate their own output. This step surfaces silent failures like SUM() ignoring NULLs, CURRENT_DATE assuming UTC, or unhandled zero-denominator divisions.
Pitfall Guide
1. Schema Flooding
Explanation: Pasting an entire database dump into the prompt. This exhausts the context window, dilutes attention on relevant tables, and increases token costs.
Fix: Implement schema pruning. Query your information schema or ORM metadata to extract only tables involved in the analytical question. Include foreign key relationships and inline enum comments.
2. Implicit Metric Definitions
Explanation: Assuming the model understands internal business jargon. "Active users", "gross margin", and "churn" have organization-specific calculations.
Fix: Maintain a versioned business glossary block. Inject it into every prompt. Update it when finance or product changes calculation methodologies.
3. Dialect Agnosticism
Explanation: Failing to specify the SQL flavor. Date functions, string concatenation, and window function syntax vary drastically between PostgreSQL, MySQL, BigQuery, and Snowflake.
Fix: Always prefix prompts with Database: [Flavor] [Version]. Example: Database: Snowflake 8.12. This prevents DATE_TRUNC vs DATE_TRUNC(date, MONTH) mismatches.
4. The "It Ran" Fallacy
Explanation: Assuming execution equals correctness. A query that returns rows without syntax errors can still violate business logic, misapply filters, or double-count records.
Fix: Implement a validation protocol. Run queries against a sampled dataset (LIMIT 100). Cross-check totals against a known dashboard. Use semantic diffing tools to compare AI output against manually verified queries.
5. Static Prompting
Explanation: Treating the first AI response as final. LLMs are iterative reasoning engines, not deterministic compilers.
Fix: Adopt a refinement loop. Paste the initial output back with targeted corrections: Adjust the date range to fiscal quarters. Exclude test merchants (merchant_id IN (...)). Use LEFT JOIN for the events table. Iteration converges on accuracy faster than perfect first-prompt engineering.
6. Timezone & Calendar Blindness
Explanation: Date filtering without explicit timezone or calendar context. CURRENT_DATE behaves differently across regions. Fiscal calendars don't align with Gregorian months.
Fix: Explicitly define date boundaries in the glossary. Use AT TIME ZONE clauses. Specify whether queries should use UTC, local business hours, or fiscal periods.
7. Null Propagation Neglect
Explanation: Aggregating columns that contain NULLs without safeguards. SUM() silently ignores NULLs, skewing averages. COUNT(column) excludes NULLs, while COUNT(*) includes them.
Fix: Mandate COALESCE() for numeric aggregations and NULLIF() for division operations. Require the model to explicitly handle NULL semantics in the validation step.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Ad-hoc analyst query (single table, simple aggregation) | Direct prompt with schema + glossary | Low complexity; CoT adds unnecessary latency | Low token cost, fast turnaround |
| Multi-join analytical query (cohort, funnel, retention) | Chain-of-thought decomposition + few-shot examples | Prevents JOIN cardinality errors and incorrect window framing | Moderate token cost, higher accuracy |
| Enterprise-scale schema (100+ tables) | Schema pruning via metadata API + vector retrieval | Prevents context window exhaustion and attention dilution | Higher initial setup cost, lower per-query cost |
| Strict compliance/finance reporting | Glossary injection + validation loop + manual sign-off | Eliminates semantic drift and ensures auditability | Higher validation overhead, zero rollback risk |
| Real-time dashboard generation | Pre-validated query templates + parameterized AI substitution | Guarantees performance and prevents runtime syntax errors | Highest upfront engineering, lowest operational risk |
Configuration Template
SYSTEM: You are a senior data engineer specializing in [DATABASE_FLAVOR] [VERSION].
Generate production-ready SQL that strictly adheres to the provided schema, glossary, and style conventions.
SCHEMA CONTEXT:
[Insert pruned DDL with inline enum comments and FK relationships]
BUSINESS GLOSSARY:
- "[Term 1]": [Explicit calculation/logic]
- "[Term 2]": [Explicit calculation/logic]
- "[Date Boundary]": [Explicit UTC/fiscal definition]
STYLE ANCHORS:
-- Example Q: [Sample question]
-- Example A: [Reference query demonstrating conventions]
TASK:
[Insert analytical question]
REASONING REQUIREMENTS:
1. Identify required tables and join keys.
2. Define date/timezone boundaries explicitly.
3. Outline aggregation logic and NULL handling.
4. Generate the final query.
VALIDATION CHECK:
- Flag any division by zero risks.
- Verify timezone assumptions.
- Confirm COALESCE/NULLIF usage for numeric aggregations.
- List potential edge cases before outputting SQL.
Quick Start Guide
- Extract Relevant Schema: Query your database's information schema or ORM metadata to pull DDL for tables involved in your analytical question. Strip unrelated tables.
- Annotate & Prune: Add inline comments to enum columns, foreign keys, and currency/amount fields. Ensure date columns are explicitly marked as
TIMESTAMPTZ or DATE.
- Build Glossary Block: Define business terms, date boundaries, and metric calculations in plain text. Append it to your prompt template.
- Inject & Execute: Paste the structured prompt into your LLM interface. Review the reasoning steps, validate the output against a sampled dataset, and iterate with targeted corrections.