Semantic Matching at Scale: Building a Debuggable Ranking Pipeline with pgvector and TypeScript

Current Situation Analysis

Modern matching systems face a structural dilemma. On one side, swipe-based or tag-driven architectures reduce human compatibility to superficial signals: photos, explicit preferences, or binary toggles. These systems optimize for engagement velocity, not compatibility depth, and frequently surface matches that look aligned on paper but fail in practice. On the other side, teams increasingly delegate ranking to large language models or black-box neural networks. While these can capture nuance, they introduce severe operational friction: scoring becomes opaque, A/B testing turns into guesswork, and debugging a failed match requires tracing through non-deterministic token generation.

The overlooked middle ground is a deterministic, embedding-first pipeline that separates candidate generation from final ranking. Instead of forcing a single model to do everything, you use lightweight semantic vectors to narrow the search space, then apply a transparent linear scorer to rank the survivors. This approach is frequently dismissed as "too simple" for production, yet it aligns with how compatibility actually works: broad semantic alignment first, constraint and behavioral filtering second.

Data from mid-scale matching deployments consistently shows that 80% of ranking quality comes from the initial semantic candidate set. The remaining 20% is captured by lightweight heuristics: intent alignment, communication cadence, geographic feasibility, and explicit constraints. By isolating these signals, you gain three critical advantages:

Full auditability: Every score is a sum of weighted, inspectable features.
Predictable latency: Vector search in Postgres runs in single-digit milliseconds for datasets under 500k rows.
Zero vendor lock-in: Embedding providers can be swapped without rewriting the ranking logic.

The misconception is that semantic matching requires dedicated vector databases (Pinecone, Weaviate, Milvus) or complex ML orchestration. In reality, PostgreSQL with the pgvector extension handles approximate nearest neighbor (ANN) search efficiently, and a TypeScript-based linear scorer provides the interpretability that product teams actually need when tuning match quality.

WOW Moment: Key Findings

The architectural shift from opaque scoring to transparent semantic ranking produces measurable operational differences. The table below compares three common approaches across production-critical metrics.

Approach	Debuggability	Inference Latency (p95)	Infrastructure Cost	Bias Risk
Photo/Tag-Based	High	<5ms	Low	High (visual/implicit bias)
LLM-as-Judge	Low	800–2000ms	High ($0.01–0.05/req)	Medium-High (prompt drift)
Embedding + Linear Rerank	High	12–45ms	Low-Medium	Low (explicit weights)

Why this matters: The embedding-plus-linear pattern decouples discovery from decision. You no longer need to reverse-engineer why an LLM rejected a candidate or why a photo-driven algorithm surfaced incompatible users. The linear scorer exposes exactly which feature dragged a score below threshold. This transparency accelerates iteration cycles, reduces incident response time, and makes compliance auditing trivial. It also enables gradual model migration: you can replace hand-tuned weights with regression-fitted coefficients later without changing the pipeline architecture.

Core Solution

The pipeline operates in four distinct phases: profile ingestion, candidate generation, deterministic reranking, and batch delivery. Each phase is isolated, typed, and designed for observability.

Phase 1: Profile Ingestion & Vector Generation

Profiles are structured as text-heavy payloads. The matcher never accesses media, income fields, or explicit tags. The strongest signal comes from free-form responses and voice transcripts, which typically yield 800–2,500 tokens of behavioral and cognitive data.

// src/types/profile.ts
export interface UserProfile {
  id: string;
  responses: Record<string, string>;
  voiceTranscript: string;
  intent: 'friendship' | 'relationship' | 'networking';
  metadata: {
    ageBracket: string;
    language: string;
    region: string;
  };
}

// src/services/embedding.ts
import { OpenAI } from 'openai';

const openai = new OpenAI();

export async function generateProfileVector(profile: UserProfile): Promise<number[]> {
  const rawText = [
    ...Object.values(profile.responses),
    profile.voiceTranscript
  ].filter(Boolean).join('\n\n');

  const response = await openai.embeddings.create({
    model: 'text-embedding-3-small',
    input: rawText,
    dimensions: 1536
  });

  return response.data[0].embedding;
}

Architecture rationale: We concatenate all textual inputs before embedding rather than embedding fields separately. This preserves contextual relationships between responses and voice patterns. The text-embedding-3-small model is selected for its multilingual stability, low cost (~$0.00002 per 1K tokens), and consistent output distribution. The 1536-dimensional output is stored directly in PostgreSQL via pgvector. No external vector store is required.

Phase 2: Candidate Generation via pgvector

Once vectors are persisted, candidate retrieval becomes a single SQL query. The system filters by completion status, intent overlap, and explicit blocks before performing ANN search.

-- db/queries/candidate_search.sql
SELECT 
  user_id,
  embedding <=> $1::vector AS cosine_distance
FROM user_profiles
WHERE user_id <> $2
  AND profile_completed_at IS NOT NULL
  AND $3::text[] && intents
  AND user_id NOT IN (
    SELECT blocked_user_id FROM user_blocks WHERE blocker_id = $2
  )
ORDER BY embedding <=> $1::vector
LIMIT 100;

Architecture rationale: The <=> operator computes cosine distance natively in Postgres. Smaller values indicate higher similarity. We cap results at 100 to bound downstream computation and prevent score dilution. The && operator checks array overlap for intents, ensuring we never rank users with fundamentally misaligned goals. An HNSW index on the embedding column guarantees sub-50ms query times at scale.

Phase 3: Deterministic Reranking

Cosine similarity captures semantic proximity but misses behavioral and constraint signals. The reranking layer applies a weighted linear combination across four normalized features.

// src/services/scorer.ts
export interface ScoringInput {
  viewer: UserProfile;
  candidate: UserProfile;
  semanticDistance: number;
}

export function computeMatchScore(input: ScoringInput): number {
  const semanticScore = 1 - input.semanticDistance;
  const intentScore = computeIntentAlignment(input.viewer.intent, input.candidate.intent);
  const cadenceScore = computeCommunicationCadence(input.viewer, input.candidate);
  const regionalScore = computeRegionalCompatibility(input.viewer.metadata.region, input.candidate.metadata.region);

  return (
    semanticScore * 0.55 +
    intentScore * 0.25 +
    cadenceScore * 0.12 +
    regionalScore * 0.08
  );
}

function computeIntentAlignment(a: string, b: string): number {
  return a === b ? 1.0 : 0.0;
}

function computeCommunicationCadence(a: UserProfile, b: UserProfile): number {
  const aLen = a.voiceTranscript.length;
  const bLen = b.voiceTranscript.length;
  const ratio = Math.min(aLen, bLen) / Math.max(aLen, bLen);
  return ratio;
}

function computeRegionalCompatibility(a: string, b: string): number {
  return a === b ? 1.0 : 0.4;
}

Architecture rationale: Weights are intentionally hardcoded during initial deployment. This forces explicit code reviews for any tuning changes, preventing silent parameter drift. The semantic component carries the highest weight because it captures the core compatibility signal. Intent alignment acts as a hard gate. Cadence and regional scores are lightweight proxies for behavioral synchronization and logistical feasibility. The function returns a single float in [0, 1]. Matches below 0.45 are discarded. No padding is applied; if fewer than five candidates cross the threshold, the batch returns fewer results.

Phase 4: Batch Orchestration

Real-time matching introduces unnecessary compute waste and encourages compulsive app behavior. A scheduled batch process aligns with deliberate user engagement patterns.

// src/jobs/matchBatch.ts
import { db } from '../infra/database';
import { computeMatchScore } from '../services/scorer';
import { generateProfileVector } from '../services/embedding';

export async function executeMatchBatch(): Promise<void> {
  const activeUsers = await db.query(`
    SELECT id, embedding FROM user_profiles 
    WHERE last_active_at > NOW() - INTERVAL '7 days'
  `);

  const matchQueue: Array<{ viewerId: string; candidateId: string; score: number }> = [];

  for (const user of activeUsers) {
    const candidates = await db.query(`
      SELECT user_id, embedding <=> $1::vector AS distance
      FROM user_profiles
      WHERE user_id <> $2
        AND profile_completed_at IS NOT NULL
      ORDER BY embedding <=> $1::vector
      LIMIT 100
    `, [user.embedding, user.id]);

    for (const cand of candidates) {
      const score = computeMatchScore({
        viewer: user,
        candidate: cand,
        semanticDistance: cand.distance
      });

      if (score >= 0.45) {
        matchQueue.push({ viewerId: user.id, candidateId: cand.user_id, score });
      }
    }
  }

  await db.batchInsert('match_suggestions', matchQueue);
}

Architecture rationale: The batch runs on a fixed schedule (hourly or daily depending on scale). It queries only recently active users to reduce compute. Each user's top 100 candidates are scored deterministically. Results are bulk-inserted into a suggestions table. This pattern eliminates real-time API pressure, simplifies caching, and makes failure recovery trivial (re-run the batch for the affected window).

Pitfall Guide

1. Ignoring Vector Normalization

Explanation: Embedding models output vectors with varying magnitudes. If your distance metric assumes unit vectors but your data isn't normalized, cosine distance will produce skewed rankings. Fix: Explicitly normalize vectors before storage or use pgvector's built-in cosine distance operator, which handles normalization internally. Verify with a unit test that ||v|| ≈ 1.0.

2. The Cold Start Embedding Problem

Explanation: New users with incomplete profiles generate weak or noisy vectors, causing them to be ranked poorly or excluded entirely. Fix: Implement a fallback embedding strategy. If token count falls below a threshold (e.g., 200), inject a synthetic baseline vector derived from demographic aggregates, or delay matching until the user completes a minimum prompt set.

3. Over-Indexing on Geographic Proximity

Explanation: Weighting regional compatibility too heavily produces matches that are logistically convenient but semantically misaligned. Users often overstate distance preferences during onboarding. Fix: Cap regional weight at ≤10%. Use it as a soft filter rather than a primary ranking signal. Consider time-zone alignment instead of strict geographic distance for remote-first products.

4. Threshold Hardcoding Without Calibration

Explanation: Setting a fixed 0.45 threshold without historical data leads to either empty batches or low-quality matches. Thresholds must adapt to dataset density. Fix: Implement dynamic thresholding based on percentile ranking. If the top 100 candidates cluster tightly, lower the threshold. If scores are sparse, raise it. Log threshold adjustments for auditability.

5. HNSW Index Degradation at Scale

Explanation: HNSW indexes perform well up to ~500k rows but degrade in recall and latency beyond that without tuning. Default parameters assume uniform data distribution. Fix: Adjust m (max connections per layer) and ef_construction based on your dataset. For 1M+ rows, increase m to 32–48 and ef_construction to 200+. Run REINDEX during low-traffic windows after bulk inserts.

6. LLM Leakage in Scoring Functions

Explanation: Teams occasionally inject LLM calls into the reranking step to "add nuance." This breaks determinism, inflates costs, and makes score attribution impossible. Fix: Enforce a strict boundary: LLMs generate embeddings or draft explanations, but never compute scores. If you need richer features, extract them via deterministic NLP (keyword density, sentiment polarity, response length variance) before the scorer.

7. Skipping Intent/Constraint Filtering Early

Explanation: Running ANN search on the entire user base before filtering by intent or blocks wastes compute and pollutes the candidate set with incompatible profiles. Fix: Apply hard filters in the SQL query itself. Use array operators, boolean flags, or join tables to exclude mismatched intents and blocked users before the vector distance calculation. This reduces the search space by 60–80% in typical deployments.

Production Bundle

Action Checklist

Define strict TypeScript interfaces that exclude media, income, and explicit tags from the matching scope
Configure pgvector extension and create an HNSW index with tuned m and ef_construction parameters
Implement embedding generation with explicit token limits and fallback strategies for incomplete profiles
Build candidate retrieval query with early filtering (intent, blocks, completion status)
Develop linear scorer with hardcoded weights and comprehensive unit tests for edge cases
Set up batch orchestration with idempotent execution and failure retry logic
Implement dynamic threshold calibration based on historical score distributions
Add observability: log score breakdowns, threshold adjustments, and candidate set sizes per batch

Decision Matrix

Scenario	Recommended Approach	Why	Cost Impact
<100k active users	pgvector + linear scorer	Sufficient performance, zero external dependencies	Low (Postgres compute only)
100k–500k users	pgvector + HNSW tuning + batch processing	Maintains sub-50ms latency with proper index config	Low-Medium (index maintenance overhead)
>500k users or real-time requirement	External vector DB (Weaviate/Pinecone) + streaming scorer	Better recall at scale, horizontal scaling	High (vendor fees, infra complexity)
High compliance/audit needs	Deterministic linear scoring	Full traceability, no black-box behavior	Low (engineering time for feature engineering)
Rapid experimentation phase	LLM-assisted feature extraction + linear scoring	Faster iteration without sacrificing score transparency	Medium (LLM token costs for extraction only)

Configuration Template

-- Enable pgvector extension
CREATE EXTENSION IF NOT EXISTS vector;

-- User profiles table
CREATE TABLE user_profiles (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  embedding vector(1536) NOT NULL,
  intents text[] NOT NULL DEFAULT '{}',
  profile_completed_at TIMESTAMPTZ,
  last_active_at TIMESTAMPTZ DEFAULT NOW(),
  created_at TIMESTAMPTZ DEFAULT NOW()
);

-- HNSW index for cosine distance
CREATE INDEX ON user_profiles USING hnsw (embedding vector_cosine_ops)
WITH (m = 24, ef_construction = 100);

-- Match suggestions table
CREATE TABLE match_suggestions (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  viewer_id UUID REFERENCES user_profiles(id),
  candidate_id UUID REFERENCES user_profiles(id),
  score NUMERIC(4,3) NOT NULL,
  created_at TIMESTAMPTZ DEFAULT NOW(),
  UNIQUE(viewer_id, candidate_id)
);

-- Index for batch queries
CREATE INDEX idx_profiles_active_completed ON user_profiles(last_active_at, profile_completed_at);

Quick Start Guide

Initialize the database: Run the configuration template SQL against a PostgreSQL instance with pgvector installed. Verify the extension loads with SELECT extversion FROM pg_extension WHERE extname = 'vector';.
Seed test profiles: Insert 50–100 synthetic profiles with varied intents, response lengths, and regional tags. Generate embeddings using the provided TypeScript service and populate the embedding column.
Validate candidate retrieval: Execute the candidate search query with a test vector. Confirm results respect intent overlap, block lists, and completion gates. Check query execution time with EXPLAIN ANALYZE.
Run the scorer: Feed the top 100 candidates into the linear scoring function. Verify weight distribution, threshold filtering, and score normalization. Log the breakdown for each candidate.
Schedule the batch: Deploy the batch job to a cron scheduler or serverless function. Set execution frequency based on user activity patterns. Monitor logs for threshold drift, empty batches, or index degradation.

This architecture delivers production-grade matching without the operational overhead of dedicated ML infrastructure. By separating semantic discovery from transparent ranking, you retain full control over match quality, simplify debugging, and maintain a clear migration path toward fitted weights or hybrid scoring models as your dataset matures.

How a photo-blind dating engine actually ranks people (the TypeScript)