RAG and Vector Search with pgvector and Amazon Bedrock (Part 4)

By Codcompass Team·2026-05-21·10 min read

Building Grounded AI Responses with PostgreSQL and Amazon Bedrock

Current Situation Analysis

Retrieval-Augmented Generation (RAG) has become the standard architecture for grounding LLM outputs in proprietary data. Yet, most implementation guides immediately point toward external vector databases like Pinecone, Weaviate, or Milvus. While these platforms excel at pure vector workloads, they introduce three persistent operational burdens: recurring infrastructure costs, separate authentication boundaries, and data synchronization complexity. For organizations already running PostgreSQL, this architectural split is often unnecessary.

The misconception stems from treating vector search as a fundamentally different problem than relational querying. In reality, vector similarity is just a mathematical distance calculation. The pgvector extension brings this capability directly into the database engine, allowing you to store embeddings alongside transactional data, query them with standard SQL, and enforce security policies at the row level. This consolidation is particularly valuable for multi-tenant SaaS applications where data isolation cannot be an afterthought.

The industry overlooks this approach because of two assumptions: that relational databases cannot scale to millions of vector rows, and that vector indexing requires specialized infrastructure. Neither holds true in modern deployments. pgvector supports approximate nearest-neighbor (ANN) algorithms that deliver sub-50ms latency on tables exceeding 10 million rows. When paired with Amazon Bedrock's amazon.titan-embed-text-v2:0 model, you gain a fully managed embedding pipeline that uses IAM roles for authentication, eliminating secret rotation entirely. The result is a RAG architecture that reduces deployment surface area, leverages existing backup and monitoring pipelines, and enforces tenant isolation through Row-Level Security (RLS) without application-level filtering.

WOW Moment: Key Findings

The architectural shift from external vector stores to PostgreSQL-native vector search yields measurable improvements in operational efficiency and security posture. The table below contrasts the two approaches across critical production metrics.

Approach	Infrastructure Cost	Tenant Isolation Mechanism	Deployment Complexity	Index Maintenance Overhead
External Vector DB	High (per-million-vector pricing + egress)	Application-level filtering or separate namespaces	High (sync pipelines, dual auth, network routing)	High (manual rebuilds, partition management)
PostgreSQL + pgvector	Low (shared compute/storage with primary DB)	Native RLS policies enforced at query time	Low (single deployment artifact, unified auth)	Medium (automated VACUUM, periodic REINDEX)

Why this matters: Consolidating vector storage into your primary datastore eliminates the synchronization lag between document ingestion and search availability. RLS policies automatically scope similarity searches to the requesting tenant, preventing cross-tenant data leakage without requiring developers to remember to add WHERE tenant_id = ? clauses. The trade-off is index tuning, but pgvector's ANN algorithms handle dynamic workloads efficiently when configured correctly. This architecture is ideal for teams that prioritize data governance, want to minimize third-party dependencies, and need predictable cost scaling.

Core Solution

Building a production-ready RAG pipeline with PostgreSQL and Bedrock requires careful coordination across four layers: embedding generation, vector storage, similarity retrieval, and LLM prompt assembly. Each layer makes specific trade-offs that impact latency, accuracy, and cost.

1. Embedding Generation with Amazon Bedrock

Both ingestion and query-time embedding must use the same model. Mixing embedding models creates incompatible vector spaces, rendering similarity calculations meaningless. We use amazon.titan-embed-text-v2:0 via the AWS SDK for JavaScript.

import { BedrockRuntimeClient, InvokeModelCommand } from "@aws-sdk/client-bedrock-runtime";

const EMBED_MODEL_ID = "amazon.titan-embed-text-v2:0";
const bedrockClient = new BedrockRuntimeClient({ region: process.env.AWS_REGION });

interface EmbeddingRequest {
  inputText: string;
  dimensions: 1024 | 512 | 256;
  normalize: boolean;
}

export async function generateEmbedding(text: string): Promise<number[]> {
  const payload: EmbeddingRequest = {
    inputText: text,
    dimensions: 1024,
    normalize: true,
  };

  const command = new InvokeModelCommand({
    modelId: EMBED_MODEL_ID,
    contentType: "application/json",
    accept: "application/json",
    body: JSON.stringify(payload),
  });

  const response = await bedrockClient.send(command);
  const decoded = new TextDecoder().decode(response.body);
  const parsed

= JSON.parse(decoded); return parsed.embedding as number[]; }


**Architecture rationale:** 
- `dimensions: 1024` maximizes retrieval precision. While 256 or 512 dimensions reduce storage footprint, they sacrifice semantic granularity. For most enterprise knowledge bases, the storage cost difference is negligible compared to the accuracy loss.
- `normalize: true` forces Bedrock to return unit-length vectors. This is critical because normalized vectors allow cosine similarity to be computed as a simple dot product. More importantly, it prevents vector magnitude from skewing distance scores. Without normalization, longer documents with higher-magnitude embeddings would artificially appear more similar.
- Authentication relies on the Lambda or ECS task IAM role. The `bedrock:InvokeModel` permission is granted via policy attachment, removing the need for API keys or secret manager lookups.

### 2. Schema Design and Vector Constraints

PostgreSQL enforces strict type checking on `pgvector` columns. Defining the dimensionality at the schema level acts as a runtime guardrail against model version mismatches.

```sql
CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE kb_documents (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  tenant_id UUID NOT NULL,
  source_name TEXT NOT NULL,
  created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE TABLE kb_segments (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  document_id UUID NOT NULL REFERENCES kb_documents(id) ON DELETE CASCADE,
  tenant_id UUID NOT NULL,
  segment_order INT NOT NULL,
  raw_text TEXT NOT NULL,
  vector_embedding vector(1024),
  created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

Architecture rationale: The vector(1024) type constraint rejects any insert that doesn't match the expected dimensionality. If you later switch to a model that outputs 512 dimensions, the database will throw a type mismatch error instead of silently corrupting your index. This fail-fast behavior is essential for maintaining pipeline integrity.

3. Indexing Strategy: IVFFlat vs HNSW

Exact nearest-neighbor search scales linearly with table size. For datasets exceeding 100k segments, approximate nearest-neighbor (ANN) indexing becomes mandatory.

CREATE INDEX idx_kb_segments_ivf 
ON kb_segments 
USING ivfflat (vector_embedding vector_cosine_ops) 
WITH (lists = 100);

Architecture rationale: IVFFlat partitions vectors into clusters (lists) during index creation. At query time, it only searches the nearest clusters rather than the entire table. The lists = 100 parameter is a starting point; the pgvector documentation suggests sqrt(total_rows) as a heuristic.

Critical constraint: IVFFlat requires existing data to build meaningful clusters. Creating it on an empty table produces a degenerate index that degrades to brute-force scanning. For continuously growing datasets, HNSW (Hierarchical Navigable Small World) is superior because it maintains search quality during incremental inserts. Switch to HNSW when your segment count exceeds 500k or when write throughput is high.

4. Vector Insertion Without Native Drivers

Many PostgreSQL drivers lack native pgvector type support. Rather than compiling C extensions or managing language-specific packages, we use string literal casting.

import { Pool } from "pg";

const dbPool = new Pool({ connectionString: process.env.DATABASE_URL });

export async function persistSegment(
  docId: string,
  tenantId: string,
  order: number,
  text: string,
  embedding: number[]
): Promise<void> {
  const vectorLiteral = `[${embedding.join(",")}]`;
  
  await dbPool.query(
    `INSERT INTO kb_segments 
     (document_id, tenant_id, segment_order, raw_text, vector_embedding)
     VALUES ($1, $2, $3, $4, $5::vector)
     ON CONFLICT DO NOTHING`,
    [docId, tenantId, order, text, vectorLiteral]
  );
}

Architecture rationale: The ::vector cast converts the comma-separated string into the native pgvector type at execution time. This approach works across all PostgreSQL drivers, eliminates native compilation dependencies, and maintains compatibility with both x86 and ARM Lambda runtimes. The ON CONFLICT DO NOTHING clause handles idempotency for retry-heavy ingestion pipelines.

5. Similarity Search with RLS Enforcement

Tenant isolation is enforced at the database layer, not the application layer. Session variables trigger RLS policies automatically.

export async function retrieveRelevantSegments(
  tenantId: string,
  queryVector: number[],
  limit: number = 8
): Promise<Array<{ id: string; text: string; distance: number; docName: string }>> {
  const vectorLiteral = `[${queryVector.join(",")}]`;
  
  const client = await dbPool.connect();
  try {
    await client.query("SET LOCAL app.tenant_id = $1", [tenantId]);
    
    const result = await client.query(
      `SELECT s.id, s.raw_text, s.segment_order, 
              d.source_name,
              s.vector_embedding <=> $2::vector AS similarity_score
       FROM kb_segments s
       JOIN kb_documents d ON d.id = s.document_id
       ORDER BY similarity_score ASC
       LIMIT $3`,
      [vectorLiteral, limit]
    );
    
    return result.rows.map(row => ({
      id: row.id,
      text: row.raw_text,
      distance: parseFloat(row.similarity_score),
      docName: row.source_name,
    }));
  } finally {
    client.release();
  }
}

Architecture rationale: The <=> operator computes cosine distance. Values range from 0 (identical) to 2 (opposite). Ordering by ascending distance returns the most semantically relevant segments first. SET LOCAL app.tenant_id triggers the RLS policy without requiring explicit WHERE tenant_id = ? clauses. This prevents accidental data leakage when developers modify queries.

The limit: 8 default balances recall and latency. Eight segments at ~500 tokens each yield ~4,000 context tokens, which fits comfortably within Claude's optimal processing window while minimizing noise. Increasing this number improves recall but degrades generation speed and increases hallucination risk.

6. Prompt Assembly and LLM Generation

The retrieved segments are formatted with explicit citation markers. The system prompt enforces strict grounding rules.

import { BedrockRuntimeClient, InvokeModelCommand } from "@aws-sdk/client-bedrock-runtime";

const LLM_MODEL_ID = "anthropic.claude-haiku-4-5-20240307-v1:0";
const llmClient = new BedrockRuntimeClient({ region: process.env.AWS_REGION });

export async function generateGroundedResponse(
  userQuestion: string,
  segments: Array<{ id: string; text: string; docName: string }>
): Promise<{ answer: string; citations: Array<{ docId: string; excerpt: string }> }> {
  const contextBlocks = segments
    .map((seg, idx) => `[${idx + 1}] From "${seg.docName}":\n${seg.text}`)
    .join("\n\n");

  const systemPrompt = `You are a technical research assistant. Answer the user's question using ONLY the provided excerpts. Cite sources using [1], [2], etc. If the answer is not in the excerpts, state that clearly.`;
  
  const userPrompt = `Excerpts:\n${contextBlocks}\n\nQuestion: ${userQuestion}`;

  const payload = {
    anthropic_version: "bedrock-2023-05-31",
    max_tokens: 1024,
    system: systemPrompt,
    messages: [{ role: "user", content: userPrompt }],
  };

  const command = new InvokeModelCommand({
    modelId: LLM_MODEL_ID,
    contentType: "application/json",
    accept: "application/json",
    body: JSON.stringify(payload),
  });

  const response = await llmClient.send(command);
  const decoded = new TextDecoder().decode(response.body);
  const parsed = JSON.parse(decoded);
  const answer = parsed.content[0].text;

  const citations = segments.map(seg => ({
    docId: seg.id,
    excerpt: seg.text.slice(0, 200),
  }));

  return { answer, citations };
}

Architecture rationale: Claude Haiku 4.5 is selected for its speed and cost efficiency. The task is primarily summarization and formatting, not complex reasoning or knowledge retrieval. The max_tokens: 1024 cap ensures predictable latency and cost. The citation extraction happens deterministically on the application side, guaranteeing that the frontend receives structured source data regardless of how the LLM formats its inline references. This separation of concerns improves UI rendering reliability and enables click-to-verify functionality.

Pitfall Guide

Pitfall	Explanation	Fix
Cross-Model Embedding Mismatch	Using different models for ingestion vs. query creates incompatible vector spaces. Cosine similarity becomes mathematically invalid.	Enforce a single model ID in a shared configuration module. Add integration tests that verify embedding dimensionality matches schema constraints.
IVFFlat Initialization on Empty Tables	IVFFlat clusters vectors at build time. An empty table produces a degenerate index that degrades to O(N) scanning.	Populate the table with representative data before creating the index. For dynamic workloads, switch to HNSW or schedule periodic `REINDEX` operations.
Unnormalized Vector Magnitudes	Skipping `normalize: true` causes longer documents to have higher-magnitude embeddings, skewing distance scores toward length rather than semantics.	Always set `normalize: true` in Bedrock requests. Verify unit length in ingestion pipelines with a simple magnitude check.
Explicit Tenant Filtering Over RLS	Adding `WHERE tenant_id = ?` manually is error-prone and bypasses database-level security guarantees. Developers may forget it in new queries.	Rely exclusively on RLS policies triggered by session variables. Use database roles with restricted permissions to prevent direct table access.
Context Window Saturation	Retrieving 20+ chunks overwhelms the LLM's attention mechanism, increasing latency and hallucination rates.	Cap retrieval at 8-12 chunks. Implement a relevance threshold (e.g., `similarity_score < 0.35`) to filter out weak matches before prompt assembly.
Fixed-Window Document Splitting	Character-count chunking cuts mid-sentence or mid-table, destroying semantic coherence and retrieval accuracy.	Use recursive sentence splitters or semantic chunking that detects topic boundaries. Preserve paragraph structure and table integrity.
Misinterpreting Cosine Distance Scores	The `<=>` operator returns distance (0-2), not probability. Treating 0.8 as "80% confidence" is mathematically incorrect.	Document score ranges clearly. Use relative ranking for retrieval, not absolute thresholds. Normalize scores to 0-1 only if your UI requires it.

Production Bundle

Action Checklist

Verify embedding model consistency across ingestion and query pipelines
Set normalize: true and dimensions: 1024 in all Bedrock embedding requests
Create pgvector extension and enforce vector(1024) type constraints at schema level
Build IVFFlat index only after seeding representative data, or switch to HNSW for high-write workloads
Implement RLS policies triggered by session variables instead of application-level WHERE clauses
Cap retrieval at 8 chunks and apply a minimum similarity threshold before prompt assembly
Use string literal casting (::vector) for driver-agnostic insertion without native dependencies
Structure LLM responses with deterministic citation arrays for frontend rendering reliability

Decision Matrix

Scenario	Recommended Approach	Why	Cost Impact
< 500k segments, batch ingestion	IVFFlat with `lists = sqrt(rows)`	Lower memory overhead, faster index builds, sufficient recall for static datasets	Low (shared Postgres compute)
> 500k segments, continuous writes	HNSW (`m = 16, ef_construction = 64`)	Maintains search quality during incremental inserts, no rebuild required	Medium (higher RAM usage, ~15-20% more storage)
Multi-tenant SaaS with strict compliance	PostgreSQL RLS + session variables	Eliminates application-level filtering bugs, auditable at DB level, zero cross-tenant leakage risk	Low (no additional infrastructure)
High-throughput ingestion pipeline	String literal casting + `ON CONFLICT`	Avoids native driver compilation, idempotent retries, ARM/x86 compatible	Low (reduces Lambda cold start overhead)

Configuration Template

-- Enable extension and enforce dimensionality
CREATE EXTENSION IF NOT EXISTS vector;

-- Document metadata
CREATE TABLE kb_documents (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  tenant_id UUID NOT NULL,
  source_name TEXT NOT NULL,
  created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

-- Segments with vector constraint
CREATE TABLE kb_segments (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  document_id UUID NOT NULL REFERENCES kb_documents(id) ON DELETE CASCADE,
  tenant_id UUID NOT NULL,
  segment_order INT NOT NULL,
  raw_text TEXT NOT NULL,
  vector_embedding vector(1024),
  created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

-- RLS Policy
ALTER TABLE kb_segments ENABLE ROW LEVEL SECURITY;
CREATE POLICY tenant_isolation ON kb_segments
  USING (tenant_id = current_setting('app.tenant_id')::UUID);

-- Index (choose based on write pattern)
-- IVFFlat for batch/static:
CREATE INDEX idx_segments_ivf ON kb_segments USING ivfflat (vector_embedding vector_cosine_ops) WITH (lists = 100);
-- HNSW for dynamic/high-write:
-- CREATE INDEX idx_segments_hnsw ON kb_segments USING hnsw (vector_embedding vector_cosine_ops) WITH (m = 16, ef_construction = 64);

// .env
DATABASE_URL=postgresql://user:pass@host:5432/kb_db
AWS_REGION=us-east-1
BEDROCK_EMBED_MODEL=amazon.titan-embed-text-v2:0
BEDROCK_LLM_MODEL=anthropic.claude-haiku-4-5-20240307-v1:0
MAX_RETRIEVAL_CHUNKS=8
SIMILARITY_THRESHOLD=0.35

Quick Start Guide

Initialize Database: Run the configuration template SQL against your PostgreSQL instance. Verify the vector extension is active and dimension constraints are enforced.
Configure IAM: Attach bedrock:InvokeModel permissions to your compute role. Remove any hardcoded API keys or secret manager references for Bedrock access.
Deploy Ingestion Pipeline: Use the string-casting insertion method to load documents. Ensure normalize: true and dimensions: 1024 are set in all embedding requests. Build the ANN index after initial data load.
Test Query Path: Execute a similarity search with a known tenant ID. Verify RLS isolation by querying with a mismatched tenant session variable. Confirm the LLM returns structured citations matching the retrieved segments.
Monitor & Tune: Track query latency and similarity_score distributions. Adjust lists (IVFFlat) or m/ef_construction (HNSW) based on recall metrics. Implement chunk relevance filtering if noise increases.

🎉 Mid-Year Sale — Unlock Full Article

Base plan from just $4.99/mo or $49/yr

7-day free trial · Cancel anytime · 30-day money-back