merchant_txn_pending_batch
ON merchant_transactions (shop_id, created_at)
WHERE status = 'pending';
**Rationale:** Partial indexes drastically reduce storage overhead and maintenance cost by excluding completed or archived rows from the index structure. This keeps the index smaller, improves cache locality, and accelerates webhook worker scans. Always validate index usage with `EXPLAIN (ANALYZE, BUFFERS)` on tables exceeding 100,000 rows. A `Seq Scan` or `Index Only Scan` with high buffer reads indicates misaligned index ordering or missing covering columns.
### 2. Connection Multiplexing with PgBouncer
PostgreSQL's process-per-connection model does not scale to bursty webhook traffic. PgBouncer acts as a lightweight connection pooler that multiplexes thousands of application connections over a fixed set of database processes.
**Implementation:**
```ini
[pgbouncer]
listen_addr = 0.0.0.0
listen_port = 6432
pool_mode = transaction
max_client_conn = 2000
default_pool_size = 30
reserve_pool_size = 5
reserve_pool_timeout = 3
server_idle_timeout = 60
Rationale: transaction mode is mandatory for multi-tenant workloads. It releases the backend connection back to the pool immediately after each transaction completes, allowing high concurrency without exhausting database processes. The reserve_pool prevents connection starvation during sudden traffic spikes. Application code remains unchanged; only the connection string points to PgBouncer instead of the raw PostgreSQL host.
3. Query Consolidation and N+1 Elimination
Webhook workers frequently exhibit the N+1 pattern: fetching a batch of merchant records, then iterating to fetch related line items, inventory adjustments, or fulfillment states. At scale, this generates thousands of round-trips per webhook batch.
Implementation:
// Consolidated fetch using explicit JOIN and parameterized bounds
async function fetchMerchantTransactionBatch(
pool: Pool,
shopId: string,
since: Date,
limit: number
): Promise<TransactionRow[]> {
const query = `
SELECT
mt.id, mt.shopify_order_id, mt.total_amount,
mtl.sku_title, mtl.quantity, mtl.unit_price
FROM merchant_transactions mt
INNER JOIN merchant_transaction_lines mtl
ON mtl.transaction_id = mt.id
WHERE mt.shop_id = $1
AND mt.created_at >= $2
AND mt.status = $3
ORDER BY mt.created_at DESC
LIMIT $4
`;
const res = await pool.query(query, [shopId, since, 'pending', limit]);
return res.rows;
}
Rationale: A single JOIN query reduces network round-trips from N+1 to 1. PostgreSQL's query planner optimizes hash joins and nested loops efficiently when proper indexes exist. This pattern also enables batch processing in webhook workers, where results are mapped in-memory rather than fetched iteratively.
4. Read/Write Traffic Separation
Analytical queries, merchant dashboards, and bulk exports must never execute against the primary instance. Long-running SELECT statements hold snapshot isolation locks that block autovacuum, increase table bloat, and compete with transactional writes for I/O bandwidth.
Implementation:
import { Pool } from 'pg';
const writePool = new Pool({ connectionString: process.env.PRIMARY_DB_URL });
const readPool = new Pool({ connectionString: process.env.REPLICA_DB_URL });
export const db = {
write: async <T>(text: string, params?: any[]): Promise<T> => {
const res = await writePool.query(text, params);
return res.rows as T;
},
read: async <T>(text: string, params?: any[]): Promise<T> => {
const res = await readPool.query(text, params);
return res.rows as T;
}
};
Rationale: Explicit pool separation enforces traffic routing at the application layer. Read replicas handle eventual consistency gracefully for dashboards and exports. For operations requiring immediate consistency (e.g., post-webhook state validation), route explicitly to writePool. This separation alone typically reduces primary CPU load by 40-60%.
5. Deterministic Query Result Caching
Merchant configuration, aggregated order summaries, and inventory thresholds change infrequently but are queried on nearly every request. Repeated database hits for static or slow-moving data waste I/O and connection capacity.
Implementation:
import Redis from 'ioredis';
const cache = new Redis(process.env.REDIS_URL);
async function getMerchantOrderSummary(shopId: string): Promise<SummaryRow[]> {
const cacheKey = `merchant:summary:${shopId}`;
const cached = await cache.get(cacheKey);
if (cached) {
return JSON.parse(cached) as SummaryRow[];
}
const rows = await db.read<SummaryRow[]>(`
SELECT status, COUNT(*) as total, SUM(total_amount) as revenue
FROM merchant_transactions
WHERE shop_id = $1
GROUP BY status
`, [shopId]);
await cache.set(cacheKey, JSON.stringify(rows), 'EX', 300);
return rows;
}
Rationale: A 300-second TTL balances freshness with load reduction. Cache invalidation should be event-driven: purge the key when a merchant updates their configuration or when a bulk import completes. This pattern shifts repetitive aggregation queries from the database to in-memory storage, reducing primary load and improving dashboard rendering times.
Pitfall Guide
1. Index Order Misalignment
Explanation: Developers frequently create indexes on (status, shop_id) or (created_at, shop_id). PostgreSQL cannot use these indexes to filter by tenant first, resulting in full index scans or sequential scans.
Fix: Always place the tenant identifier (shop_id) as the leftmost column. Use EXPLAIN to verify the planner uses an Index Scan with shop_id as the leading filter.
2. PgBouncer Session Mode Misconfiguration
Explanation: Running PgBouncer in session mode negates pooling benefits. Connections remain bound to backend processes for the entire client session, causing the same exhaustion issues as raw PostgreSQL.
Fix: Enforce pool_mode = transaction. Audit application code to ensure no session-level variables (SET, PREPARE, LISTEN) are used, as these break transaction pooling.
3. Replica Lag Blind Spots
Explanation: Routing all reads to replicas introduces eventual consistency. If a webhook updates merchant state and the next request immediately reads from a lagging replica, the application returns stale data.
Fix: Implement read-your-writes routing. After a write transaction, force the subsequent read for that tenant to the primary for a short window, or use a cache layer that invalidates on write.
4. Cache Stampede on Tenant Data
Explanation: When a cache key expires during high traffic, thousands of concurrent requests hit the database simultaneously to regenerate the result.
Fix: Use probabilistic expiration (randomize TTL by Β±10-20%) or implement a distributed lock (SETNX) so only one request rebuilds the cache while others wait or serve stale data with a background refresh.
5. Autovacuum Starvation from Analytical Queries
Explanation: Long-running SELECT queries on the primary hold transaction snapshots that prevent dead tuple removal. Table bloat increases, index efficiency degrades, and query planners choose sequential scans.
Fix: Route all analytical, reporting, and export queries to read replicas. Monitor pg_stat_user_tables for n_dead_tup and last_autovacuum timestamps. Tune autovacuum_vacuum_scale_factor if bloat persists.
6. Connection Pool Exhaustion During Webhook Bursts
Explanation: Even with PgBouncer, sudden webhook spikes can exhaust the default_pool_size, causing connection wait timeouts.
Fix: Implement backpressure at the queue layer. Use reserve_pool_size and monitor pgbouncer.stats for maxwait. If wait times exceed 500ms, scale the pool size or throttle webhook consumption.
7. Index Bloat from High-Volume Updates
Explanation: Frequent status updates (pending β fulfilled β archived) cause index bloat, degrading scan performance over time.
Fix: Schedule regular REINDEX CONCURRENTLY during low-traffic windows. Consider partitioning tables by created_at to isolate hot data and simplify maintenance.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| < 2,000 merchants | Single primary + PgBouncer | Simpler ops, sufficient I/O headroom | Low (baseline) |
| 2,000β8,000 merchants | Primary + Read Replica + Redis | Isolates analytics, reduces primary CPU | Medium (+replica + cache infra) |
| > 8,000 merchants or heavy analytics | Sharded primary by shop_id + replicas | Prevents single-node I/O saturation | High (shard router + multi-node) |
| Webhook-heavy workloads | PgBouncer transaction mode + queue backpressure | Prevents connection exhaustion during bursts | Low (config change) |
| Dashboard/Export heavy | Read replica + materialized views | Offloads aggregation, avoids primary blocking | Medium (replica + view maintenance) |
Configuration Template
# pgbouncer.ini
[databases]
* = host=primary-db.internal port=5432
[pgbouncer]
listen_addr = 0.0.0.0
listen_port = 6432
pool_mode = transaction
max_client_conn = 2000
default_pool_size = 30
reserve_pool_size = 5
reserve_pool_timeout = 3
server_idle_timeout = 60
server_login_retry = 5
query_timeout = 30
log_connections = 1
log_disconnections = 1
stats_period = 60
// data-access-layer.ts
import { Pool } from 'pg';
import Redis from 'ioredis';
const primary = new Pool({ connectionString: process.env.PRIMARY_DB_URL });
const replica = new Pool({ connectionString: process.env.REPLICA_DB_URL });
const cache = new Redis(process.env.REDIS_URL);
export const DataAccess = {
executeWrite: async <T>(sql: string, params?: any[]): Promise<T> => {
const res = await primary.query(sql, params);
return res.rows as T;
},
executeRead: async <T>(sql: string, params?: any[]): Promise<T> => {
const res = await replica.query(sql, params);
return res.rows as T;
},
getCached: async <T>(key: string, ttl: number, fetchFn: () => Promise<T>): Promise<T> => {
const cached = await cache.get(key);
if (cached) return JSON.parse(cached) as T;
const data = await fetchFn();
await cache.set(key, JSON.stringify(data), 'EX', ttl);
return data;
},
invalidateCache: async (key: string): Promise<void> => {
await cache.del(key);
}
};
Quick Start Guide
- Install PgBouncer: Deploy PgBouncer via Docker or package manager. Point
listen_addr to your internal network and configure pool_mode = transaction. Update your application's database connection string to target PgBouncer instead of PostgreSQL directly.
- Audit Indexes: Run
SELECT indexname, indexdef FROM pg_indexes WHERE tablename = 'your_table'; for all multi-tenant tables. Recreate indexes ensuring shop_id is the first column. Use CREATE INDEX CONCURRENTLY to avoid table locks.
- Route Reads: Provision a read replica in your cloud provider. Create a second
pg.Pool instance pointing to the replica. Refactor dashboard and export queries to use the read pool. Test replica lag with SELECT pg_last_xact_replay_timestamp();.
- Add Caching: Deploy Redis. Wrap repetitive aggregation queries with the
getCached helper. Set TTLs between 120-600 seconds depending on data volatility. Implement cache invalidation hooks in your webhook processors.
- Validate: Run
EXPLAIN (ANALYZE, BUFFERS) on top 10 slow queries. Verify index usage, connection pool metrics in PgBouncer (SHOW STATS;), and cache hit ratios. Adjust pool sizes and TTLs based on observed load.