= JSON.parse(decoded);
return parsed.embedding as number[];
}
**Architecture rationale:**
- `dimensions: 1024` maximizes retrieval precision. While 256 or 512 dimensions reduce storage footprint, they sacrifice semantic granularity. For most enterprise knowledge bases, the storage cost difference is negligible compared to the accuracy loss.
- `normalize: true` forces Bedrock to return unit-length vectors. This is critical because normalized vectors allow cosine similarity to be computed as a simple dot product. More importantly, it prevents vector magnitude from skewing distance scores. Without normalization, longer documents with higher-magnitude embeddings would artificially appear more similar.
- Authentication relies on the Lambda or ECS task IAM role. The `bedrock:InvokeModel` permission is granted via policy attachment, removing the need for API keys or secret manager lookups.
### 2. Schema Design and Vector Constraints
PostgreSQL enforces strict type checking on `pgvector` columns. Defining the dimensionality at the schema level acts as a runtime guardrail against model version mismatches.
```sql
CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE kb_documents (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
tenant_id UUID NOT NULL,
source_name TEXT NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE TABLE kb_segments (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
document_id UUID NOT NULL REFERENCES kb_documents(id) ON DELETE CASCADE,
tenant_id UUID NOT NULL,
segment_order INT NOT NULL,
raw_text TEXT NOT NULL,
vector_embedding vector(1024),
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
Architecture rationale:
The vector(1024) type constraint rejects any insert that doesn't match the expected dimensionality. If you later switch to a model that outputs 512 dimensions, the database will throw a type mismatch error instead of silently corrupting your index. This fail-fast behavior is essential for maintaining pipeline integrity.
3. Indexing Strategy: IVFFlat vs HNSW
Exact nearest-neighbor search scales linearly with table size. For datasets exceeding 100k segments, approximate nearest-neighbor (ANN) indexing becomes mandatory.
CREATE INDEX idx_kb_segments_ivf
ON kb_segments
USING ivfflat (vector_embedding vector_cosine_ops)
WITH (lists = 100);
Architecture rationale:
IVFFlat partitions vectors into clusters (lists) during index creation. At query time, it only searches the nearest clusters rather than the entire table. The lists = 100 parameter is a starting point; the pgvector documentation suggests sqrt(total_rows) as a heuristic.
Critical constraint: IVFFlat requires existing data to build meaningful clusters. Creating it on an empty table produces a degenerate index that degrades to brute-force scanning. For continuously growing datasets, HNSW (Hierarchical Navigable Small World) is superior because it maintains search quality during incremental inserts. Switch to HNSW when your segment count exceeds 500k or when write throughput is high.
4. Vector Insertion Without Native Drivers
Many PostgreSQL drivers lack native pgvector type support. Rather than compiling C extensions or managing language-specific packages, we use string literal casting.
import { Pool } from "pg";
const dbPool = new Pool({ connectionString: process.env.DATABASE_URL });
export async function persistSegment(
docId: string,
tenantId: string,
order: number,
text: string,
embedding: number[]
): Promise<void> {
const vectorLiteral = `[${embedding.join(",")}]`;
await dbPool.query(
`INSERT INTO kb_segments
(document_id, tenant_id, segment_order, raw_text, vector_embedding)
VALUES ($1, $2, $3, $4, $5::vector)
ON CONFLICT DO NOTHING`,
[docId, tenantId, order, text, vectorLiteral]
);
}
Architecture rationale:
The ::vector cast converts the comma-separated string into the native pgvector type at execution time. This approach works across all PostgreSQL drivers, eliminates native compilation dependencies, and maintains compatibility with both x86 and ARM Lambda runtimes. The ON CONFLICT DO NOTHING clause handles idempotency for retry-heavy ingestion pipelines.
5. Similarity Search with RLS Enforcement
Tenant isolation is enforced at the database layer, not the application layer. Session variables trigger RLS policies automatically.
export async function retrieveRelevantSegments(
tenantId: string,
queryVector: number[],
limit: number = 8
): Promise<Array<{ id: string; text: string; distance: number; docName: string }>> {
const vectorLiteral = `[${queryVector.join(",")}]`;
const client = await dbPool.connect();
try {
await client.query("SET LOCAL app.tenant_id = $1", [tenantId]);
const result = await client.query(
`SELECT s.id, s.raw_text, s.segment_order,
d.source_name,
s.vector_embedding <=> $2::vector AS similarity_score
FROM kb_segments s
JOIN kb_documents d ON d.id = s.document_id
ORDER BY similarity_score ASC
LIMIT $3`,
[vectorLiteral, limit]
);
return result.rows.map(row => ({
id: row.id,
text: row.raw_text,
distance: parseFloat(row.similarity_score),
docName: row.source_name,
}));
} finally {
client.release();
}
}
Architecture rationale:
The <=> operator computes cosine distance. Values range from 0 (identical) to 2 (opposite). Ordering by ascending distance returns the most semantically relevant segments first. SET LOCAL app.tenant_id triggers the RLS policy without requiring explicit WHERE tenant_id = ? clauses. This prevents accidental data leakage when developers modify queries.
The limit: 8 default balances recall and latency. Eight segments at ~500 tokens each yield ~4,000 context tokens, which fits comfortably within Claude's optimal processing window while minimizing noise. Increasing this number improves recall but degrades generation speed and increases hallucination risk.
6. Prompt Assembly and LLM Generation
The retrieved segments are formatted with explicit citation markers. The system prompt enforces strict grounding rules.
import { BedrockRuntimeClient, InvokeModelCommand } from "@aws-sdk/client-bedrock-runtime";
const LLM_MODEL_ID = "anthropic.claude-haiku-4-5-20240307-v1:0";
const llmClient = new BedrockRuntimeClient({ region: process.env.AWS_REGION });
export async function generateGroundedResponse(
userQuestion: string,
segments: Array<{ id: string; text: string; docName: string }>
): Promise<{ answer: string; citations: Array<{ docId: string; excerpt: string }> }> {
const contextBlocks = segments
.map((seg, idx) => `[${idx + 1}] From "${seg.docName}":\n${seg.text}`)
.join("\n\n");
const systemPrompt = `You are a technical research assistant. Answer the user's question using ONLY the provided excerpts. Cite sources using [1], [2], etc. If the answer is not in the excerpts, state that clearly.`;
const userPrompt = `Excerpts:\n${contextBlocks}\n\nQuestion: ${userQuestion}`;
const payload = {
anthropic_version: "bedrock-2023-05-31",
max_tokens: 1024,
system: systemPrompt,
messages: [{ role: "user", content: userPrompt }],
};
const command = new InvokeModelCommand({
modelId: LLM_MODEL_ID,
contentType: "application/json",
accept: "application/json",
body: JSON.stringify(payload),
});
const response = await llmClient.send(command);
const decoded = new TextDecoder().decode(response.body);
const parsed = JSON.parse(decoded);
const answer = parsed.content[0].text;
const citations = segments.map(seg => ({
docId: seg.id,
excerpt: seg.text.slice(0, 200),
}));
return { answer, citations };
}
Architecture rationale:
Claude Haiku 4.5 is selected for its speed and cost efficiency. The task is primarily summarization and formatting, not complex reasoning or knowledge retrieval. The max_tokens: 1024 cap ensures predictable latency and cost. The citation extraction happens deterministically on the application side, guaranteeing that the frontend receives structured source data regardless of how the LLM formats its inline references. This separation of concerns improves UI rendering reliability and enables click-to-verify functionality.
Pitfall Guide
| Pitfall | Explanation | Fix |
|---|
| Cross-Model Embedding Mismatch | Using different models for ingestion vs. query creates incompatible vector spaces. Cosine similarity becomes mathematically invalid. | Enforce a single model ID in a shared configuration module. Add integration tests that verify embedding dimensionality matches schema constraints. |
| IVFFlat Initialization on Empty Tables | IVFFlat clusters vectors at build time. An empty table produces a degenerate index that degrades to O(N) scanning. | Populate the table with representative data before creating the index. For dynamic workloads, switch to HNSW or schedule periodic REINDEX operations. |
| Unnormalized Vector Magnitudes | Skipping normalize: true causes longer documents to have higher-magnitude embeddings, skewing distance scores toward length rather than semantics. | Always set normalize: true in Bedrock requests. Verify unit length in ingestion pipelines with a simple magnitude check. |
| Explicit Tenant Filtering Over RLS | Adding WHERE tenant_id = ? manually is error-prone and bypasses database-level security guarantees. Developers may forget it in new queries. | Rely exclusively on RLS policies triggered by session variables. Use database roles with restricted permissions to prevent direct table access. |
| Context Window Saturation | Retrieving 20+ chunks overwhelms the LLM's attention mechanism, increasing latency and hallucination rates. | Cap retrieval at 8-12 chunks. Implement a relevance threshold (e.g., similarity_score < 0.35) to filter out weak matches before prompt assembly. |
| Fixed-Window Document Splitting | Character-count chunking cuts mid-sentence or mid-table, destroying semantic coherence and retrieval accuracy. | Use recursive sentence splitters or semantic chunking that detects topic boundaries. Preserve paragraph structure and table integrity. |
| Misinterpreting Cosine Distance Scores | The <=> operator returns distance (0-2), not probability. Treating 0.8 as "80% confidence" is mathematically incorrect. | Document score ranges clearly. Use relative ranking for retrieval, not absolute thresholds. Normalize scores to 0-1 only if your UI requires it. |
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| < 500k segments, batch ingestion | IVFFlat with lists = sqrt(rows) | Lower memory overhead, faster index builds, sufficient recall for static datasets | Low (shared Postgres compute) |
| > 500k segments, continuous writes | HNSW (m = 16, ef_construction = 64) | Maintains search quality during incremental inserts, no rebuild required | Medium (higher RAM usage, ~15-20% more storage) |
| Multi-tenant SaaS with strict compliance | PostgreSQL RLS + session variables | Eliminates application-level filtering bugs, auditable at DB level, zero cross-tenant leakage risk | Low (no additional infrastructure) |
| High-throughput ingestion pipeline | String literal casting + ON CONFLICT | Avoids native driver compilation, idempotent retries, ARM/x86 compatible | Low (reduces Lambda cold start overhead) |
Configuration Template
-- Enable extension and enforce dimensionality
CREATE EXTENSION IF NOT EXISTS vector;
-- Document metadata
CREATE TABLE kb_documents (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
tenant_id UUID NOT NULL,
source_name TEXT NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
-- Segments with vector constraint
CREATE TABLE kb_segments (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
document_id UUID NOT NULL REFERENCES kb_documents(id) ON DELETE CASCADE,
tenant_id UUID NOT NULL,
segment_order INT NOT NULL,
raw_text TEXT NOT NULL,
vector_embedding vector(1024),
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
-- RLS Policy
ALTER TABLE kb_segments ENABLE ROW LEVEL SECURITY;
CREATE POLICY tenant_isolation ON kb_segments
USING (tenant_id = current_setting('app.tenant_id')::UUID);
-- Index (choose based on write pattern)
-- IVFFlat for batch/static:
CREATE INDEX idx_segments_ivf ON kb_segments USING ivfflat (vector_embedding vector_cosine_ops) WITH (lists = 100);
-- HNSW for dynamic/high-write:
-- CREATE INDEX idx_segments_hnsw ON kb_segments USING hnsw (vector_embedding vector_cosine_ops) WITH (m = 16, ef_construction = 64);
// .env
DATABASE_URL=postgresql://user:pass@host:5432/kb_db
AWS_REGION=us-east-1
BEDROCK_EMBED_MODEL=amazon.titan-embed-text-v2:0
BEDROCK_LLM_MODEL=anthropic.claude-haiku-4-5-20240307-v1:0
MAX_RETRIEVAL_CHUNKS=8
SIMILARITY_THRESHOLD=0.35
Quick Start Guide
- Initialize Database: Run the configuration template SQL against your PostgreSQL instance. Verify the
vector extension is active and dimension constraints are enforced.
- Configure IAM: Attach
bedrock:InvokeModel permissions to your compute role. Remove any hardcoded API keys or secret manager references for Bedrock access.
- Deploy Ingestion Pipeline: Use the string-casting insertion method to load documents. Ensure
normalize: true and dimensions: 1024 are set in all embedding requests. Build the ANN index after initial data load.
- Test Query Path: Execute a similarity search with a known tenant ID. Verify RLS isolation by querying with a mismatched tenant session variable. Confirm the LLM returns structured citations matching the retrieved segments.
- Monitor & Tune: Track query latency and
similarity_score distributions. Adjust lists (IVFFlat) or m/ef_construction (HNSW) based on recall metrics. Implement chunk relevance filtering if noise increases.