The RAG tool that auto-generates Q&A pairs from your documents

By Codcompass Team·2026-05-22·8 min read

Beyond Naive Chunking: Architecting High-Precision RAG with LLM-Driven QA Extraction

Current Situation Analysis

Traditional Retrieval-Augmented Generation pipelines rely heavily on fixed-size text chunking. Engineers split documents into 500–1000 token blocks, embed them, and store the vectors. At query time, the system retrieves the most semantically similar chunks and feeds them to the LLM. This approach is simple to implement but fundamentally flawed for structured knowledge retrieval. Documents rarely align with user intent boundaries. A single paragraph might contain pricing, return policies, and technical specifications. When a user asks a specific question, naive chunking forces the retrieval engine to guess which fragment holds the answer, often returning partial context or irrelevant noise.

The industry overlooks a critical insight: retrieval accuracy depends less on embedding model quality and more on query-document alignment. If the stored representation matches the expected query format, semantic distance shrinks dramatically. FastGPT addresses this by introducing LLM-driven question-answer pair extraction. Instead of embedding raw text, the system parses documents, generates structured Q&A pairs, and embeds only the questions. At runtime, user queries match directly against pre-formulated questions, bypassing the semantic fragmentation problem entirely.

This architecture has gained significant traction, evidenced by 27K GitHub stars and widespread adoption in internal knowledge bases. Yet, implementation details remain fragmented. Most English-language documentation focuses on basic setup, ignoring the architectural trade-offs, license constraints, and production hardening required for enterprise deployment. Teams frequently default to naive chunking because it requires zero preprocessing, sacrificing long-term retrieval precision for short-term development speed. Others deploy QA extraction without understanding when it fails, leading to brittle pipelines that break on narrative or highly technical documentation.

WOW Moment: Key Findings

The performance delta between naive chunking and LLM-driven QA extraction becomes stark when measuring retrieval precision against maintenance overhead. The following comparison isolates the core trade-offs across three common RAG preprocessing strategies.

Approach	Retrieval Precision	Pre-processing Latency	Maintenance Overhead	License Flexibility
Naive Chunking	Low-Medium	Near-zero	High (manual threshold tuning)	High (MIT/Apache)
LLM QA Extraction	High	Medium (LLM call per document)	Low (automated structuring)	Restricted (Custom)
Hybrid (Keyword+Vector)	Medium	Low	Medium (dual-index sync)	High (MIT/Apache)

Why this matters: QA extraction shifts compute cost from runtime to ingestion. You pay upfront with LLM inference to structure knowledge, but gain deterministic retrieval at query time. This is critical for customer support bots, compliance documentation, and internal HR/IT knowledge bases where accuracy outweighs raw speed. The license restriction, however, demands careful architectural planning if the platform will ever be exposed to external clients or resold.

Core Solution

Building a production-grade QA extraction pipeline requires three coordinated layers: document ingestion, LLM-driven structuring, and vector-backed retrieval. Below is a step-by-step implementation using TypeScript for

the extraction workflow and Docker Compose for infrastructure.

Step 1: Infrastructure Provisioning

FastGPT relies on PostgreSQL with pgvector for semantic search and MongoDB for conversation state. The following configuration isolates services, enforces health checks, and prepares the environment for Ollama integration.

version: '3.8'

services:
  fastgpt-core:
    image: ghcr.io/labring/fastgpt:latest
    ports:
      - "3000:3000"
    environment:
      - MONGO_URI=mongodb://mongo:27017/fastgpt
      - PGVECTOR_URI=postgresql://postgres:secure_pass@pgvector:5432/fastgpt
      - OPENAI_BASE_URL=http://ollama:11434/v1
      - OPENAI_API_KEY=ollama
    depends_on:
      mongo:
        condition: service_healthy
      pgvector:
        condition: service_healthy
    restart: unless-stopped

  pgvector:
    image: pgvector/pgvector:pg16
    environment:
      POSTGRES_USER: postgres
      POSTGRES_PASSWORD: secure_pass
      POSTGRES_DB: fastgpt
    volumes:
      - pg_data:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U postgres"]
      interval: 5s
      timeout: 3s
      retries: 5

  mongo:
    image: mongo:7
    environment:
      MONGO_INITDB_DATABASE: fastgpt
    volumes:
      - mongo_data:/data/db
    healthcheck:
      test: ["CMD", "mongosh", "--eval", "db.adminCommand('ping')"]
      interval: 5s
      timeout: 3s
      retries: 5

  ollama:
    image: ollama/ollama:latest
    ports:
      - "11434:11434"
    volumes:
      - ollama_models:/root/.ollama

volumes:
  pg_data:
  mongo_data:
  ollama_models:

Step 2: QA Extraction Pipeline (TypeScript)

Naive chunking embeds raw text. QA extraction requires an LLM to parse documents and output structured pairs. The following module demonstrates how to call an OpenAI-compatible endpoint, enforce JSON schema validation, and prepare embeddings for storage.

import { OpenAI } from 'openai';
import { z } from 'zod';

const qaPairSchema = z.object({
  question: z.string().min(10).max(200),
  answer: z.string().min(5).max(1000),
  category: z.enum(['policy', 'technical', 'billing', 'general']).optional()
});

const qaBatchSchema = z.array(qaPairSchema);

export class KnowledgeExtractor {
  private client: OpenAI;
  private embeddingModel: string;

  constructor(baseURL: string, apiKey: string, embeddingModel: string) {
    this.client = new OpenAI({ baseURL, apiKey });
    this.embeddingModel = embeddingModel;
  }

  async extractQAFromDocument(rawText: string): Promise<z.infer<typeof qaBatchSchema>> {
    const prompt = `
      Analyze the following document excerpt. Extract exactly 3-5 distinct question-answer pairs.
      Format the output as a JSON array matching the schema:
      { "question": string, "answer": string, "category": "policy" | "technical" | "billing" | "general" }
      
      Document:
      ${rawText}
    `;

    const response = await this.client.chat.completions.create({
      model: 'llama3',
      messages: [{ role: 'user', content: prompt }],
      response_format: { type: 'json_object' },
      temperature: 0.2
    });

    const rawOutput = response.choices[0]?.message?.content ?? '[]';
    const parsed = JSON.parse(rawOutput);
    return qaBatchSchema.parse(parsed);
  }

  async generateEmbeddings(questions: string[]): Promise<number[][]> {
    const embeddings: number[][] = [];
    for (const q of questions) {
      const res = await this.client.embeddings.create({
        model: this.embeddingModel,
        input: q
      });
      embeddings.push(res.data[0].embedding);
    }
    return embeddings;
  }
}

Step 3: Architecture Rationale

Why PostgreSQL with pgvector? Vector search requires ACID compliance, transactional safety, and mature indexing strategies. pgvector supports HNSW and IVFFlat indexes, enabling sub-millisecond similarity searches at scale. MongoDB lacks native vector capabilities, making it unsuitable as the primary retrieval store.

Why MongoDB for conversation state? Chat history, audit logs, and user sessions require flexible schemas and high write throughput. MongoDB's document model aligns naturally with session tracking, while keeping vector operations isolated in PostgreSQL.

Why node-based routing? FastGPT's visual workflow builder decouples intent classification from retrieval strategy. Instead of a monolithic pipeline, you route queries through conditional nodes: intent classification → FAQ lookup → document search → fallback response. This modularity reduces hallucination rates and simplifies debugging.

Pitfall Guide

1. License Misinterpretation

Explanation: FastGPT uses a custom license that explicitly prohibits reselling the platform as a managed SaaS to third parties. Teams often assume "open source" equals commercial freedom. Fix: Verify deployment scope. Internal team usage and backend integration into proprietary products are permitted. If external commercialization is planned, migrate to MaxKB (Apache 2.0) or WeKnora (MIT).

2. Ollama Endpoint Misconfiguration

Explanation: Ollama exposes an OpenAI-compatible API, but the base path must include /v1. Omitting it or using the root path causes 404 errors during embedding or chat requests. Fix: Always configure OPENAI_BASE_URL=http://ollama:11434/v1. The API key field accepts any non-empty string when running locally, but production deployments should enforce token authentication.

3. Over-Extraction on Unstructured Data

Explanation: QA extraction excels on policy documents, FAQs, and procedural guides. It fails on narrative text, research papers, or highly technical specifications where context spans multiple paragraphs. Fix: Implement a document classifier node before ingestion. Route structured content to QA extraction and unstructured content to hybrid chunking with overlap. Never force QA generation on documents lacking clear Q&A boundaries.

4. Ignoring Vector Index Tuning

Explanation: Default pgvector settings use brute-force search. As the knowledge base grows beyond 10K vectors, latency degrades exponentially. Fix: Create an HNSW index after initial ingestion: CREATE INDEX ON vectors USING hnsw (embedding vector_cosine_ops) WITH (m = 16, ef_construction = 64);. Adjust m and ef_construction based on memory constraints and query latency requirements.

5. Skipping Fallback Routing in Workflows

Explanation: Visual pipelines often chain nodes without error handling. If the QA retrieval node returns zero matches, the pipeline fails silently or returns generic errors. Fix: Always attach a fallback branch. Route low-confidence scores (<0.75 cosine similarity) to a secondary search node, then to a generic LLM response with a disclaimer. Log all fallback triggers for pipeline optimization.

6. Prompt Drift in QA Generation

Explanation: LLM-generated Q&A pairs vary in tone, length, and terminology across documents. Inconsistent phrasing reduces retrieval accuracy because semantically similar questions use different vocabulary. Fix: Enforce strict prompt templates with few-shot examples. Add a post-processing step that normalizes terminology using a synonym dictionary or a secondary LLM call focused on standardization.

7. Production Security Gaps

Explanation: Default credentials (root/1234) and unencrypted HTTP endpoints are common in development. Exposing these in production invites unauthorized access and data exfiltration. Fix: Rotate default credentials immediately. Terminate SSL at a reverse proxy (Nginx/Traefik). Restrict MongoDB and PostgreSQL to internal Docker networks. Implement rate limiting on the /v1 Ollama endpoint.

Production Bundle

Action Checklist

Verify license compliance: confirm internal-only usage or switch to Apache/MIT alternatives
Configure Ollama base URL with /v1 suffix and validate connectivity via curl
Implement document classification before ingestion to route structured vs unstructured content
Create HNSW vector index on PostgreSQL after initial data load
Add fallback routing nodes to all visual workflows with confidence thresholds
Standardize QA generation prompts with few-shot examples and terminology normalization
Rotate default credentials and enforce TLS termination at the reverse proxy
Monitor retrieval confidence scores and log fallback triggers for pipeline tuning

Decision Matrix

Scenario	Recommended Approach	Why	Cost Impact
Internal HR/IT Knowledge Base	LLM QA Extraction	High accuracy on policy documents, low maintenance	Medium (LLM ingestion cost)
Customer Support Bot	LLM QA Extraction + Fallback	Direct question matching reduces hallucination	Medium-High (requires workflow tuning)
Commercial SaaS Product	MaxKB or WeKnora	Apache/MIT license permits resale and white-labeling	Low (no license restrictions)
Technical Research Archive	Hybrid Chunking + Keyword Search	QA extraction fails on dense, cross-referenced content	Low-Medium (dual-index overhead)
High-Volume Real-Time Chat	Naive Chunking + Aggressive Caching	QA preprocessing adds latency; speed prioritized	Low (minimal compute)

Configuration Template

# .env.production
MONGO_URI=mongodb://app_user:strong_password@mongo:27017/fastgpt_prod
PGVECTOR_URI=postgresql://app_user:strong_password@pgvector:5432/fastgpt_prod
OPENAI_BASE_URL=http://ollama:11434/v1
OPENAI_API_KEY=prod_ollama_token_12345
DEFAULT_EMBEDDING_MODEL=text-embedding-3-small
QA_EXTRACTION_MODEL=llama3
WORKFLOW_TIMEOUT_MS=5000
VECTOR_SIMILARITY_THRESHOLD=0.75
LOG_LEVEL=info

Quick Start Guide

Initialize Infrastructure: Clone the repository, copy .env.example to .env, populate credentials, and run docker compose up -d. Verify services via docker compose ps.
Connect LLM Provider: Navigate to localhost:3000, log in with default credentials, and configure the AI model settings. Set provider to OpenAI Compatible, base URL to http://ollama:11434/v1, and model to llama3.
Ingest Knowledge Base: Upload documents, select QA Split processing mode, and trigger extraction. Monitor the ingestion queue for JSON validation errors.
Build Routing Workflow: Use the visual node editor to create a pipeline: Intent Classifier → QA Retrieval → Confidence Check → Fallback LLM. Set similarity threshold to 0.75.
Validate & Iterate: Test with 20+ domain-specific queries. Log low-confidence retrievals, refine prompts, and adjust vector index parameters. Rotate credentials and enable TLS before public exposure.

🎉 Mid-Year Sale — Unlock Full Article

Base plan from just $4.99/mo or $49/yr

7-day free trial · Cancel anytime · 30-day money-back