Back to KB
Difficulty
Intermediate
Read Time
8 min

How to build a production RAG pipeline in Python (without a vector database)

By Codcompass TeamΒ·Β·8 min read

Beyond Embeddings: Engineering High-Throughput RAG with BM25 Retrieval

Current Situation Analysis

The modern RAG landscape suffers from a persistent architectural bias: teams default to vector databases the moment they need to ground an LLM in proprietary data. This reflex introduces an embedding pipeline, GPU dependencies, and persistent storage overhead before validating whether semantic similarity actually solves the retrieval problem.

The misunderstanding stems from conflating open-domain question answering with domain-specific knowledge retrieval. Semantic embeddings excel when queries use colloquial phrasing or when documents lack consistent terminology. However, technical documentation, compliance frameworks, internal runbooks, and product manuals rely on precise lexical matching. In these environments, BM25 (Best Matching 25) consistently outperforms or matches embedding-based recall while eliminating the computational tax of vectorization.

Empirical benchmarks across structured corpora demonstrate that BM25 achieves 85–95% of the recall provided by dense vector retrieval, with sub-10ms query latency and zero GPU compute. On a 1,600-article cybersecurity knowledge base, a pure BM25 pipeline delivered a 91% hit rate at k=5 without a single embedding call. The operational advantage is equally significant: no vector index maintenance, no embedding model versioning, and no semantic drift monitoring. For domain-constrained retrieval, lexical search remains the most pragmatic foundation.

WOW Moment: Key Findings

The following comparison isolates the operational and performance trade-offs between vector-native RAG and BM25-native RAG when applied to domain-specific corpora.

ApproachQuery LatencyInfrastructure CostDomain Recall (k=5)Operational Overhead
Vector RAG (Pinecone/Weaviate)45–120msHigh (GPU/embedding pipeline + vector storage)92–96%High (model versioning, index rebuilding, drift monitoring)
BM25 RAG (Meilisearch)3–9msLow (CPU-only, stateless indexing)85–95%Low (schema configuration, routine reindexing)

This finding matters because it decouples retrieval quality from infrastructure complexity. Teams can ship grounded LLM applications in days rather than weeks, iterate on prompt engineering without re-embedding entire corpora, and scale horizontally using standard CPU instances. The trade-off is explicit: BM25 requires consistent terminology and benefits from query normalization, but it removes the entire embedding lifecycle from the critical path.

Core Solution

Building a production-grade BM25 RAG pipeline requires deliberate architecture choices around indexing, retrieval, context assembly, and generation. The following implementation uses a class-based design to encapsulate state, enforce type safety, and separate concerns.

Architecture Rationale

  • Class-based encapsulation: Prevents global state leakage and enables dependency injection for testing.
  • Explicit grounding prompt: Forces the model to cite sources and reject out-of-scope queries, reducing hallucination.
  • Streaming generation: Decouples retrieval latency from user-perceived response time, critical for long-form answers.
  • Deterministic document IDs: Enables cache invalidation, golden dataset validation, and chunk reconstruction.

Step 1: Document Ingestion & Index Configuration

import meilisearch
import hashlib
import json
from typing import List, Dict, Optional

class KnowledgeIndex:
    def __init__(self, host: str, api_key: str, index_name: str):
        self.client = meilisearch.Client(host, api_key)
        self.index_name = index_name
        self._ensure_index()

    def _ensure_index(self) -> None:
  

πŸŽ‰ Mid-Year Sale β€” Unlock Full Article

Base plan from just $4.99/mo or $49/yr

Sign in to read the full article and unlock all 635+ tutorials.

Sign In / Register β€” Start Free Trial

7-day free trial Β· Cancel anytime Β· 30-day money-back