Engineering RAG for Constrained Environments: A Multi-Stage Retrieval Blueprint

Current Situation Analysis

The gap between a functional RAG prototype and a production-grade retrieval system is rarely about model selection. It is almost always about memory management, context window discipline, and deterministic citation tracing. In industrial and compliance-heavy sectors, standard academic RAG pipelines collapse under real-world constraints. A prototype that runs smoothly on a developer workstation with 32GB of RAM will frequently trigger Out-Of-Memory (OOM) exceptions when deployed to constrained environments like a 512MB free-tier container.

This problem is systematically overlooked because most RAG tutorials optimize for retrieval accuracy in isolation. They assume infinite context windows, clean text extraction, and synchronous request handling. In reality, technical documentation contains dense terminology, fragmented tables, and repeated safety notices. When fed into a naive vector pipeline, these documents cause context truncation, hallucination, and redundant token consumption. Furthermore, regulatory frameworks like the EU AI Act mandate transparent, page-level citation tracing for safety-critical systems. A RAG assistant that cannot guarantee exact source attribution is legally non-viable in European manufacturing environments.

The data confirms that brute-forcing context size is not a viable optimization strategy. Baseline testing on a constrained deployment reveals a sharp trade-off between window size, retrieval accuracy, and latency. Without empirical tuning, teams waste compute on oversized context windows that degrade p95 latency while yielding marginal accuracy gains. Production RAG requires a multi-stage retrieval architecture that balances semantic alignment, keyword precision, memory footprint, and deterministic output generation.

WOW Moment: Key Findings

The most critical bottleneck in constrained RAG deployments is not the embedding model or the vector database. It is the context window configuration (num_ctx) fed to the generation model. Over-provisioning the window consumes RAM and increases latency, while under-provisioning triggers truncation and destroys faithfulness.

Empirical testing across three window configurations reveals a clear inflection point:

Context Window (`num_ctx`)	Faithfulness	Context Recall	p95 Latency	Operational Status
512 (Baseline)	0.583	0.554	~1.9s	⚠️ High context truncation
2048 (Optimal)	0.724	0.712	~3.2s	✅ Low truncation, high accuracy
4096 (Wasteful)	0.731	0.718	~5.9s	❌ Too slow for production

Moving from 512 to 2048 tokens delivers a +14.1% improvement in faithfulness and +15.8% in recall, while keeping latency within acceptable bounds for real-time assistance. Pushing to 4096 yields only a +0.7% faithfulness gain but triples latency, violating strict p95 <2.0s targets for interactive UIs. This finding proves that precision tuning beats brute-force context expansion. It also enables deterministic citation generation, as the model receives exactly enough context to ground its response without drowning in noise.

Core Solution

Building a production-ready RAG system for constrained environments requires decoupling retrieval, reranking, deduplication, and generation into distinct, async-compatible stages. The architecture leverages LlamaIndex for orchestration, Qdrant for vector storage, and Mistral-7B for generation. Below is the step-by-step implementation.

1. Query Transformation via HyDE

Technical queries from field technicians rarely match the passive, specification-heavy language of engineering manuals. Hypothetical Document Embeddings (HyDE) solves this by generating a synthetic "ideal" answer before retrieval. This hypothetical text is embedded and used for dense vector search, dramatically improving recall for domain-specific terminology.

# hyde_transformer.py
import asyncio
from llama_index.llms.mistralai import MistralAI
from llama_index.embeddings.fastembed import FastEmbedEmbedding

class HyDETransformer:
    def __init__(self, llm_model: str = "mistral-7b", embed_model: str = "BAAI/bge-small-en-v1.5"):
        self.llm = MistralAI(model=llm_model)
        self.embedder = FastEmbedEmbedding(model_name=embed_model)

    async def generate_hypothetical(self, user_query: str) -> str:
        prompt = f"Given the technical query: '{user_query}', generate a concise, specification-style answer that would appear in an engineering manual."
        response = await self.llm.acomplete(prompt)
        return response.text.strip()

    async def embed_query(self, text: str) -> list[float]:
        return await self.embedder.aget_text_embedding(text)

Rationale: HyDE shifts the embedding space from user phrasing to technical syntax. Using FastEmbed instead of PyTorch-based embedding libraries reduces memory overhead by ~60%, critical for 512MB environments.

2. Hybrid Retrieval with Reciprocal Rank Fusion (RRF)

Dense vectors excel at conceptual matching but fail on exact part numbers, tolerances, or model codes. BM25 captures lexical precision but misses semantic intent. RRF fuses both without requiring retraining.

# hybrid_retriever.py
from qdrant_client import QdrantClient
from llama_index.retrievers.bm25 import BM25Retriever
from llama_index.core.schema import NodeWithScore

class HybridRRFRetriever:
    def __init__(self, qdrant_host: str, collection_name: str, bm25_corpus: list[str]):
        self.qdrant = QdrantClient(host=qdrant_host)
        self.collection = collection_name
        self.bm25 = BM25Retriever.from_defaults(documents=bm25_corpus, similarity_top_k=10)

    async def retrieve(self, query_embedding: list[float], query_text: str) -> list[NodeWithScore]:
        dense_results = self.qdrant.search(
            collection_name=self.collection,
            query_vector=query_embedding,
            limit=10
        )
        sparse_results = await self.bm25.aretrieve(query_text)
        
        return self._apply_rrf(dense_results, sparse_results)

    def _apply_rrf(self, dense: list, sparse: list, k: int = 60) -> list[NodeWithScore]:
        rank_map = {}
        for rank, item in enumerate(dense, start=1):
            doc_id = item.payload.get("doc_id")
            rank_map[doc_id] = rank_map.get(doc_id, 0) + 1 / (k + rank)
        for rank, item in enumerate(sparse, start=1):
            doc_id = item.node.metadata.get("doc_id")
            rank_map[doc_id] = rank_map.get(doc_id, 0) + 1 / (k + rank)
        
        sorted_docs = sorted(rank_map.items(), key=lambda x: x[1], reverse=True)
        return [NodeWithScore(node=item[0], score=item[1]) for item in sorted_docs[:10]]

Rationale: RRF mathematically balances rank positions. The constant k=60 dampens the impact of low-ranked documents, preventing sparse retrieval from dominating when semantic relevance is higher.

3. Cross-Encoder Reranking & Deduplication

Retrieving 10 chunks introduces noise. A cross-encoder evaluates query-chunk pairs jointly, scoring precise relevance. Post-reranking, SHA-256 hashing removes duplicate safety notices or repeated tables.

# reranker_deduplicator.py
from sentence_transformers import CrossEncoder
import hashlib

class ContextRefiner:
    def __init__(self, model_name: str = "cross-encoder/ms-marco-MiniLM-L-6-v2"):
        self.reranker = CrossEncoder(model_name)

    async def refine(self, query: str, chunks: list[NodeWithScore]) -> list[NodeWithScore]:
        pairs = [(query, chunk.node.get_content()) for chunk in chunks]
        scores = self.reranker.predict(pairs)
        
        for chunk, score in zip(chunks, scores):
            chunk.score = float(score)
        
        ranked = sorted(chunks, key=lambda x: x.score, reverse=True)[:3]
        return self._deduplicate(ranked)

    def _deduplicate(self, chunks: list[NodeWithScore], threshold: float = 0.85) -> list[NodeWithScore]:
        seen_hashes = set()
        unique = []
        for chunk in chunks:
            normalized = chunk.node.get_content().lower().strip()
            content_hash = hashlib.sha256(normalized.encode()).hexdigest()
            if content_hash not in seen_hashes:
                seen_hashes.add(content_hash)
                unique.append(chunk)
        return unique

Rationale: Cross-encoders outperform bi-encoders for reranking because they compute full self-attention over the query-document pair. Limiting output to 3 chunks preserves context window headroom for the LLM's reasoning tokens.

4. Production API Layer (TypeScript)

The Python RAG core is exposed via a high-concurrency TypeScript server. This separation allows independent scaling and leverages Node.js event loop for I/O-bound operations.

// api-server.ts
import express, { Request, Response } from 'express';
import { createClient } from '@qdrant/js-client-rest';
import { LRUCache } from 'lru-cache';

const app = express();
app.use(express.json());

const queryCache = new LRUCache<string, any>({ 
  max: 500, 
  ttl: 3600000, // 1 hour
  allowStale: false 
});

const rateLimitWindow = new Map<string, number[]>();
const RATE_LIMIT = 10;
const WINDOW_MS = 60000;

function checkRateLimit(ip: string): boolean {
  const now = Date.now();
  const timestamps = rateLimitWindow.get(ip) || [];
  const recent = timestamps.filter(t => now - t < WINDOW_MS);
  if (recent.length >= RATE_LIMIT) return false;
  recent.push(now);
  rateLimitWindow.set(ip, recent);
  return true;
}

app.post('/api/query/stream', async (req: Request, res: Response) => {
  const ip = req.ip || req.socket.remoteAddress || '';
  if (!checkRateLimit(ip)) {
    res.status(429).set('Retry-After', '60').json({ error: 'Rate limit exceeded' });
    return;
  }

  const { query, apiKey } = req.body;
  if (apiKey !== process.env.INTERNAL_API_KEY) {
    res.status(401).json({ error: 'Invalid credentials' });
    return;
  }

  const cacheKey = `${query}-${apiKey}`;
  const cached = queryCache.get(cacheKey);
  if (cached) {
    res.json(cached);
    return;
  }

  // Proxy to Python RAG service via HTTP/GRPC
  const ragResponse = await fetch(`${process.env.RAG_CORE_URL}/v1/retrieve`, {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({ query })
  });

  const data = await ragResponse.json();
  queryCache.set(cacheKey, data);
  res.json(data);
});

app.listen(3000, () => console.log('API gateway running on port 3000'));

Rationale: TypeScript handles connection pooling, rate limiting, and caching at the edge, shielding the Python ML runtime from I/O spikes. The LRU cache with 1-hour TTL intercepts duplicate technician queries, returning results in <10ms. Sliding-window rate limiting prevents resource exhaustion without requiring external Redis dependencies.

Pitfall Guide

Pitfall	Explanation	Fix
Fixed-Size Chunking on Structured Docs	Splitting manuals by character count slices tables, bullet lists, and safety warnings in half, destroying semantic integrity.	Use document-aware chunking (e.g., LlamaIndex `MarkdownNodeParser` or `SentenceSplitter` with paragraph boundaries). Preserve table structures as atomic units.
Relying Solely on Dense Vectors	Embeddings struggle with exact part numbers, tolerances, and model codes. Technicians searching for "Model-X-500" get irrelevant conceptual matches.	Implement hybrid retrieval (BM25 + Dense) fused via RRF. Sparse retrieval guarantees lexical precision where dense vectors fail.
Context Window Guesswork	Setting `num_ctx` arbitrarily causes either truncation (hallucination) or memory bloat (OOM). Teams assume larger is always better.	Run empirical `num_ctx` sweeps. Measure faithfulness, recall, and p95 latency. Lock the smallest window that meets accuracy thresholds (typically 2048 for 7B models).
Blocking Async I/O in Pipelines	Synchronous database calls or embedding computations block the event loop, causing request queuing and timeout cascades under load.	Use async clients (`AsyncQdrantClient`, `aiohttp`). Pool connections globally. Offload CPU-heavy reranking to separate worker processes.
Skipping Content Deduplication	Safety notices and reference tables repeat across pages. Duplicate chunks waste context tokens and cause repetitive LLM outputs.	Apply SHA-256 hashing on normalized text post-retrieval. Filter duplicates before reranking to preserve context headroom.
Rate Limiting at App Layer Only	Application-level limiters fail under DDoS or misconfigured clients, exhausting database connections before requests reach business logic.	Implement sliding-window limiters at the API gateway. Pair with Nginx `limit_req` for network-layer protection.
Assuming Text Extraction Always Works	Scanned manuals, PDFs with embedded images, or legacy OCR returns empty strings. Pipeline fails silently or returns hallucinated content.	Add a fallback pipeline: detect empty extraction → render page to PNG → run Tesseract OCR → inject extracted text back into chunking stage.

Production Bundle

Action Checklist

Audit chunking strategy: Replace fixed-size splitting with document-aware parsers that preserve tables and lists.
Implement hybrid retrieval: Fuse BM25 and dense vectors using RRF with k=60 to balance semantic and lexical matching.
Tune context window empirically: Run num_ctx sweeps (512, 2048, 4096) and lock the smallest window meeting >0.70 faithfulness.
Add cross-encoder reranking: Use ms-marco-MiniLM-L-6-v2 to compress 10 retrieved chunks down to 3 high-signal candidates.
Enforce deduplication: Hash normalized chunk text with SHA-256 and filter duplicates before LLM context assembly.
Decouple API and ML runtime: Expose RAG core via Python, wrap with TypeScript gateway for caching, rate limiting, and streaming.
Configure OCR fallback: Detect empty text extraction and route through Tesseract to guarantee zero content loss.
Instrument with RAGAS + MLflow: Track faithfulness, context recall, and latency across every deployment cycle.

Decision Matrix

Scenario	Recommended Approach	Why	Cost Impact
Low-traffic internal tool (<50 req/min)	Dense-only retrieval + 512 `num_ctx`	Simpler stack, lower RAM footprint, acceptable accuracy for non-critical queries	Minimal compute cost, higher hallucination risk
Compliance-heavy industrial assistant	Hybrid RRF + Cross-Encoder + 2048 `num_ctx`	Guarantees lexical precision, deterministic citations, meets EU AI Act transparency	Moderate RAM (2-4GB), higher inference cost, legally compliant
High-concurrency public API (>500 req/min)	Hybrid RRF + LRU-TTL cache + TypeScript gateway	Caching intercepts 60-80% duplicate queries, gateway handles rate limiting, ML core stays stable	Infrastructure cost shifts to CDN/cache layer, ML compute drops significantly
Legacy scanned documentation	OCR fallback pipeline + BM25-heavy weighting	Scanned docs lack native text; BM25 tolerates OCR noise better than dense vectors	Higher CPU usage during ingestion, improved retrieval accuracy on legacy assets

Configuration Template

# docker-compose.prod.yml
version: '3.8'
services:
  rag-core:
    build: ./python-rag
    environment:
      - QDRANT_HOST=qdrant
      - NUM_CTX=2048
      - LOG_LEVEL=INFO
    deploy:
      resources:
        limits:
          memory: 1.5G
    networks:
      - internal

  api-gateway:
    build: ./ts-gateway
    environment:
      - RAG_CORE_URL=http://rag-core:8000
      - INTERNAL_API_KEY=${API_KEY}
    ports:
      - "3000:3000"
    depends_on:
      - rag-core
    networks:
      - internal
      - external

  qdrant:
    image: qdrant/qdrant:v1.7.3
    volumes:
      - qdrant_data:/qdrant/storage
    networks:
      - internal

  nginx-proxy:
    image: nginx:alpine
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf
    ports:
      - "80:80"
    depends_on:
      - api-gateway
    networks:
      - external

volumes:
  qdrant_data:

networks:
  internal:
    driver: bridge
  external:
    driver: bridge

# nginx.conf
events { worker_connections 1024; }
http {
  limit_req_zone $binary_remote_addr zone=api:10m rate=10r/m;
  
  server {
    listen 80;
    server_name _;
    
    location /api/ {
      limit_req zone=api burst=5 nodelay;
      proxy_pass http://api-gateway:3000;
      proxy_set_header X-Real-IP $remote_addr;
      proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
      proxy_buffering off;
      proxy_cache off;
      
      add_header X-Frame-Options "DENY";
      add_header X-Content-Type-Options "nosniff";
      add_header Content-Security-Policy "default-src 'self'";
    }
  }
}

Quick Start Guide

Initialize the vector store: Run docker compose up qdrant and ingest your technical manuals using the Python chunking script. Verify collection creation via Qdrant dashboard.
Deploy the ML core: Build the Python RAG service (docker compose build rag-core) and start it. Confirm health endpoint returns 200 OK and num_ctx is set to 2048.
Launch the API gateway: Build and start the TypeScript gateway (docker compose up api-gateway). Test /api/query/stream with a sample technical query. Verify LRU cache hits on second request.
Attach Nginx proxy: Start nginx-proxy and route external traffic through port 80. Validate rate limiting by sending 11 rapid requests; the 11th should return 429 Too Many Requests.
Instrument evaluation: Run the RAGAS evaluation suite against your 50+ Q&A dataset. Log metrics to MLflow. Iterate on chunking or reranking thresholds until faithfulness exceeds 0.70.

How I rescued a RAG assistant from memory leaks and got it running on a 512MB RAM free tier