Beyond Vector Similarity: Engineering Relationship-Aware Retrieval for Production RAG

Current Situation Analysis

The retrieval-augmented generation landscape has largely settled on vector similarity as the default architecture. Developers chunk documents, embed them, and query a vector database. This approach works adequately for straightforward factual lookup, but it fundamentally misrepresents how knowledge is structured. Documents are not isolated semantic blobs; they are interconnected networks of entities, methods, datasets, and conclusions. When a query requires synthesizing information across multiple sources or tracing a logical chain, vector search breaks down.

This limitation is frequently overlooked because vector databases are operationally simple. They require minimal schema design, scale predictably, and integrate seamlessly with existing embedding pipelines. However, the operational convenience masks a structural deficiency: vector retrieval lacks explicit relationship modeling. It retrieves what looks similar, not what is logically connected. In domains like scientific research, legal analysis, or technical documentation, this gap manifests as fragmented context, excessive token consumption, and degraded reasoning accuracy.

Empirical validation of this gap requires controlled benchmarking. A production-grade evaluation across a 2,000,000-token scientific paper corpus demonstrates that naive vector retrieval still pulls approximately 640 tokens per query while achieving only an 89% factual pass rate. The model is forced to sift through semantically adjacent but logically disjointed chunks. Relationship-aware retrieval, by contrast, traverses explicit entity connections, delivering verified multi-hop evidence with significantly reduced context overhead. The industry's reliance on vector-only pipelines is not a technical necessity; it is an architectural shortcut that trades reasoning depth for deployment speed.

WOW Moment: Key Findings

The benchmark compared three distinct retrieval paradigms across identical query sets: direct LLM generation, vector-based RAG, and graph-structured retrieval. The results reveal a consistent pattern: reducing context size through structural filtering directly improves both cost efficiency and factual accuracy.

Approach	Avg Tokens / Query	Avg Cost / Query	Avg Latency	LLM Judge Pass Rate	BERTScore F1
LLM-only	3,200	$0.0096	6.23 s	82%	0.57
Basic RAG (Vector)	640	$0.0019	2.85 s	89%	0.64
GraphRAG (NetworkX)	316	$0.00095	1.22 s	94%	0.71

Graph-structured retrieval cuts token consumption by 50.6% compared to vector RAG and 90.1% compared to raw LLM generation. Latency drops by 57.2% and 80.4% respectively. Most critically, accuracy improves: the LLM-as-a-Judge pass rate climbs to 94%, and BERTScore F1 reaches 0.71.

This finding matters because it dismantles the assumption that larger context windows automatically yield better answers. Noise degrades reasoning. By filtering context through explicit relationships, the model receives only verified, logically connected evidence. The reduction in token volume is not a side effect; it is the mechanism that enables higher precision. Production systems can now prioritize relationship-aware retrieval without sacrificing speed or budget.

Core Solution

Building a relationship-aware retrieval pipeline requires shifting from chunk-centric to entity-centric architecture. The following implementation demonstrates a production-ready orchestration layer that extracts entities, constructs a traversal graph, and assembles context dynamically.

Step 1: Entity & Relationship Extraction

Instead of treating documents as flat text, parse them into structured nodes and edges. An LLM extracts entities and their relationships, constrained by a strict schema to prevent hallucination.

interface EntityNode {
  id: string;
  label: string;
  type: 'method' | 'dataset' | 'concept' | 'result';
  sourceDoc: string;
}

interface GraphEdge {
  source: string;
  target: string;
  relation: string;
  confidence: number;
}

class KnowledgeExtractor {
  async extractFromDocument(docContent: string): Promise<{ nodes: EntityNode[]; edges: GraphEdge[] }> {
    const prompt = `Extract entities and relationships from the following text. Return JSON only.
    Schema: { nodes: [{ id, label, type, sourceDoc }], edges: [{ source, target, relation, confidence }] }
    Text: ${docContent}`;
    
    const response = await this.llmClient.generate(prompt, { temperature: 0.1 });
    return JSON.parse(response);
  }
}

Step 2: Graph Construction & Indexing

Store the extracted structure in an adjacency list format. For benchmarking and lightweight production, an in-memory graph library suffices. Enterprise deployments can migrate to TigerGraph or Neo4j without altering the traversal logic.

class GraphIndex {
  private adjacencyMap: Map<string, GraphEdge[]> = new Map();
  private nodeRegistry: Map<string, EntityNode> = new Map();

  upsertNodes(nodes: EntityNode[]): void {
    nodes.forEach(n => this.nodeRegistry.set(n.id, n));
  }

  upsertEdges(edges: GraphEdge[]): void {
    edges.forEach(e => {
      if (!this.adjacencyMap.has(e.source)) this.adjacencyMap.set(e.source, []);
      this.adjacencyMap.get(e.source)!.push(e);
    });
  }

  getNeighbors(nodeId: string, maxDepth: number = 2): EntityNode[] {
    const visited = new Set<string>();
    const queue: Array<{ id: string; depth: number }> = [{ id: nodeId, depth: 0 }];
    const results: EntityNode[] = [];

    while (queue.length > 0) {
      const current = queue.shift()!;
      if (visited.has(current.id) || current.depth > maxDepth) continue;
      visited.add(current.id);

      const node = this.nodeRegistry.get(current.id);
      if (node) results.push(node);

      const neighbors = this.adjacencyMap.get(current.id) || [];
      neighbors
        .filter(n => n.confidence > 0.75)
        .forEach(n => queue.push({ id: n.target, depth: current.depth + 1 }));
    }
    return results;
  }
}

Step 3: Context Assembly & Generation

Query the graph using the user's prompt, retrieve connected nodes, and format them into a focused context window. This replaces arbitrary top-k vector chunks with logically verified evidence.

class ContextAssembler {
  async buildPrompt(query: string, graph: GraphIndex, seedEntityId: string): Promise<string> {
    const connectedEvidence = graph.getNeighbors(seedEntityId, 2);
    const contextBlocks = connectedEvidence
      .map(e => `[${e.type.toUpperCase()}] ${e.label} (${e.sourceDoc})`)
      .join('\n');

    return `You are a technical analyst. Answer the query using ONLY the provided connected evidence.
    Query: ${query}
    Connected Evidence:
    ${contextBlocks}
    Constraints: Do not hallucinate. Cite source documents. If evidence is insufficient, state it explicitly.`;
  }
}

Architecture Rationale

In-Memory Graph vs. Enterprise DB: NetworkX or TypeScript adjacency maps eliminate Docker complexity, authentication overhead, and resource contention during benchmarking. The traversal logic remains identical; only the persistence layer changes when scaling to TigerGraph.
Confidence Thresholding: Edges below 0.75 confidence are pruned during traversal. This prevents low-quality LLM extractions from polluting the context window.
Dual Evaluation Pipeline: Accuracy is measured using both an independent LLM judge (meta-llama/Llama-3.1-8B-Instruct) and BERTScore F1. Vector similarity alone cannot verify factual correctness; semantic alignment must be paired with explicit grading.
Token-Aware Corpus Sampling: The 2M token dataset is constructed using tiktoken to ensure consistent context boundaries. Arbitrary document counts create unpredictable latency; token budgets do not.

Pitfall Guide

1. Unconstrained Graph Traversal

Explanation: Running breadth-first search without depth limits or edge pruning causes context explosion. The model receives dozens of loosely connected nodes, negating the efficiency gains of graph retrieval. Fix: Implement strict maxDepth parameters (typically 2-3 hops) and filter edges by confidence scores. Add a token budget cap that truncates traversal when the context window approaches 80% capacity.

2. Siloed Entity Resolution

Explanation: Extracting entities per document without cross-document deduplication creates duplicate nodes for the same concept (e.g., "Transformer Architecture" vs. "Transformer Model"). The graph fragments, breaking multi-hop paths. Fix: Run a post-extraction normalization pass. Map variants to canonical IDs using embedding similarity or LLM-based alias resolution before graph construction.

3. Single-Metric Validation

Explanation: Relying solely on BERTScore or LLM-as-a-Judge creates blind spots. BERTScore rewards paraphrasing but misses factual errors. LLM judges can be biased toward verbose answers. Fix: Always pair semantic similarity (BERTScore F1) with explicit factual grading. Track divergence between the two metrics to detect model drift or prompt degradation.

4. Static Chunking for Dynamic Queries

Explanation: Pre-chunking documents into fixed token sizes severs logical relationships. A method description might be split across two chunks, breaking the graph edge during extraction. Fix: Use semantic or structural chunking aligned with document headers, paragraphs, or entity boundaries. Re-chunk dynamically if the extraction pipeline detects split relationships.

5. Over-Engineering Infrastructure Early

Explanation: Deploying TigerGraph or Neo4j before validating the retrieval methodology introduces unnecessary operational friction. Authentication, Docker orchestration, and resource scaling distract from core benchmarking. Fix: Prototype with lightweight graph libraries (NetworkX, igraph, or TypeScript adjacency maps). Migrate to enterprise graph databases only after the traversal logic and evaluation metrics are proven stable.

6. Ignoring Context Window Limits

Explanation: GraphRAG reduces context, but poorly assembled prompts can still exceed model limits. Truncation at the API layer silently drops critical evidence. Fix: Implement a context budget calculator that counts tokens before API submission. Prioritize high-confidence edges and truncate low-relevance nodes first. Log truncation events for audit trails.

7. Hardcoded Evaluation Prompts

Explanation: LLM-as-a-Judge prompts that lack explicit grading rubrics produce inconsistent pass/fail rates. Vague instructions like "grade the answer" yield subjective outputs. Fix: Structure judge prompts with explicit criteria: factual accuracy, hallucination detection, completeness, and citation verification. Use structured output (JSON) for deterministic parsing.

Production Bundle

Action Checklist

Token-budget the corpus: Use tiktoken or equivalent to cap ingestion at a fixed token count (e.g., 2M) rather than arbitrary document counts.
Implement schema-constrained extraction: Force LLM entity/relationship extraction into strict JSON schemas with confidence scoring.
Enforce traversal limits: Set maxDepth = 2 and prune edges below 0.75 confidence to prevent context pollution.
Deploy dual evaluation: Pair BERTScore F1 with an independent LLM judge (meta-llama/Llama-3.1-8B-Instruct) for factual validation.
Normalize cross-document entities: Run alias resolution and deduplication before graph construction to preserve multi-hop paths.
Monitor context truncation: Log token counts pre-API submission and alert when truncation exceeds 5% of retrieved evidence.
Version graph artifacts: Serialize graph state alongside benchmark runs to enable rollback and reproducibility.

Decision Matrix

Scenario	Recommended Approach	Why	Cost Impact
Prototype / Benchmarking	In-memory graph (NetworkX/TS Adjacency)	Zero infrastructure overhead, fast iteration, reproducible	Near-zero compute cost
Multi-hop scientific/legal queries	GraphRAG with depth-limited traversal	Explicit relationships outperform vector similarity for synthesis	~50% lower than vector RAG
High-throughput customer support	Basic RAG (ChromaDB/Pinecone)	Low latency, simple scaling, sufficient for factual lookup	Baseline cost
Enterprise knowledge graph	TigerGraph / Neo4j + GraphRAG	ACID compliance, distributed traversal, role-based access	Higher infra cost, lower per-query token cost
Strict compliance/audit requirements	GraphRAG + BERTScore + LLM Judge	Dual validation provides defensible accuracy metrics	+15% evaluation overhead

Configuration Template

corpus:
  target_tokens: 2000000
  tokenizer: tiktoken_cl100k_base
  sampling_strategy: token_aware_random

extraction:
  model: meta-llama/Llama-3.1-8B-Instruct
  temperature: 0.1
  schema_version: v2
  confidence_threshold: 0.75

graph:
  engine: networkx
  max_traversal_depth: 2
  edge_pruning: true
  serialization_format: json_gz

evaluation:
  judge_model: meta-llama/Llama-3.1-8B-Instruct
  metrics:
    - bertscore_f1
    - llm_pass_rate
  query_count: 40
  categories: [factual, multi_hop, synthesis, entity_based]

deployment:
  frontend: nextjs_vercel
  backend: fastapi_huggingface_spaces
  artifact_storage: local_json
  reproducibility: docker_compose

Quick Start Guide

Initialize the corpus pipeline: Run the token-aware sampler to download and chunk documents until the 2,000,000 token threshold is reached. Verify boundaries using tiktoken.
Extract and build the graph: Execute the schema-constrained extraction against the corpus. Feed the output into the adjacency builder, applying the 0.75 confidence filter.
Configure evaluation: Load the 40 benchmark questions. Set up the LLM judge prompt with explicit factual criteria and initialize BERTScore F1 calculation.
Run the orchestrator: Execute the traversal engine against each query. Capture tokens, latency, cost, and accuracy metrics per run.
Validate and iterate: Compare GraphRAG outputs against vector RAG and LLM-only baselines. Adjust depth limits and confidence thresholds if context pollution or under-retrieval occurs.

GraphRAG Benchmark: A 2 Million Token Comparison of LLM-only, Basic RAG, and GraphRAG