Building a Zero-Dependency RAG Assistant for Legacy CMS Platforms

Current Situation Analysis

Legacy content management systems and static sites frequently become operational bottlenecks when customer support volume scales. Front-line staff absorb repetitive inquiries about operating hours, service availability, booking procedures, and policy details. Traditional SaaS chatbot solutions charge $50–$150 monthly, often deliver shallow responses due to generic training data, and require framework migrations or heavy plugin installations that destabilize aging codebases.

Development teams frequently assume that deploying a grounded AI assistant requires a full frontend rewrite, a dedicated backend service, or expensive vector database infrastructure. This perception creates a false barrier to entry. Modern edge computing runtimes, combined with generous free-tier LLM APIs, have fundamentally altered the cost and complexity equation. A fully retrieval-augmented generation (RAG) pipeline can now be constructed using zero external dependencies on the client side, deployed to edge networks, and maintained with minimal operational overhead.

The core misunderstanding lies in treating AI integration as a monolithic application rather than a composable data flow. By decoupling content ingestion, vector indexing, query routing, and UI rendering, teams can inject production-grade AI capabilities into any HTML-rendering platform. Cloudflare’s free tier accommodates approximately 100,000 worker invocations monthly, Vectorize supports up to 5 million dimensions at no cost, and Google’s Gemini free tier covers substantial embedding and generation volumes. For low-to-moderate traffic sites, the entire stack operates within zero-cost boundaries while delivering higher factual accuracy than commercial alternatives.

WOW Moment: Key Findings

The following comparison illustrates the operational and economic divergence between traditional chatbot deployment strategies and an edge-native RAG architecture.

Approach	Monthly Cost	Deployment Time	Grounding Accuracy	Maintenance Overhead
SaaS Widget	$50–$150	<1 hour	Low (generic training)	None
Full-Stack AI	$30–$80	2–4 weeks	High	High (framework updates, infra scaling)
Edge-Native RAG	$0–$5	<3 hours	High (site-specific indexing)	Low (scheduled re-indexing)

This finding matters because it decouples AI capability from infrastructure complexity. Teams can achieve enterprise-grade grounding without provisioning dedicated servers, managing container orchestration, or migrating legacy templates. The edge-native approach shifts the operational burden from runtime scaling to scheduled data synchronization, which aligns better with the update cadence of most content sites. It also eliminates vendor lock-in, as the vector index and worker logic remain portable across compatible edge providers.

Core Solution

The architecture follows a four-stage pipeline: content extraction, vector indexing, edge query routing, and isolated client rendering. Each stage operates independently, enabling parallel development and straightforward debugging.

1. Content Extraction & Semantic Chunking

Raw HTML contains navigation, footers, scripts, and styling that introduce noise into vector embeddings. The ingestion pipeline must strip non-content elements and split text into semantically coherent units. Fixed-character chunking fractures sentences and degrades retrieval quality. A paragraph-aware splitter with configurable overlap preserves contextual boundaries.

// lib/ingest.ts
import * as cheerio from "cheerio";
import { readFileSync, writeFileSync } from "fs";

interface ContentChunk {
  id: string;
  sourceUrl: string;
  text: string;
  metadata: Record<string, string>;
}

export async function extractChunks(seedUrl: string, maxPages: number = 150): Promise<ContentChunk[]> {
  const visited = new Set<string>();
  const queue = [seedUrl];
  const chunks: ContentChunk[] = [];

  while (queue.length > 0 && visited.size < maxPages) {
    const currentUrl = queue.shift()!;
    if (visited.has(currentUrl)) continue;
    visited.add(currentUrl);

    const response = await fetch(currentUrl);
    if (!response.ok) continue;

    const html = await response.text();
    const $ = cheerio.load(html);
    
    $("nav, footer, script, style, .ad-container, .cookie-banner").remove();
    const rawText = $("main, article, .content").text().trim() || $("body").text().trim();
    
    const paragraphs = rawText.split(/\n\s*\n/).filter(p => p.length > 40);
    paragraphs.forEach((para, idx) => {
      chunks.push({
        id: `${currentUrl}#chunk-${idx}`,
        sourceUrl: currentUrl,
        text: para.replace(/\s+/g, " ").trim(),
        metadata: { url: currentUrl, section: "main-content" }
      });
    });

    const links = $("a[href]").map((_, el) => $(el).attr("href")).get();
    links.forEach(link => {
      try {
        const absolute = new URL(link, currentUrl).toString();
        if (absolute.startsWith(seedUrl) && !visited.has(absolute)) {
          queue.push(absolute);
        }
      } catch {}
    });
  }

  return chunks;
}

Architecture Rationale: cheerio provides a lightweight DOM parser that executes faster than jsdom for extraction tasks. Paragraph-level splitting aligns with how humans consume information, improving embedding density. Storing source URLs in metadata enables traceable citations during generation.

2. Vector Indexing Pipeline

Embedding models convert text into high-dimensional vectors. Batch processing reduces API latency and avoids rate-limit exhaustion. The text-embedding-004 model produces 768-dimensional vectors optimized for short-form content retrieval.

// lib/vectorize.ts
const EMBEDDING_API = "https://generativelanguage.googleapis.com/v1beta/models/text-embedding-004:embedContent";

export async function generateEmbeddings(chunks: { id: string; text: string }[], apiKey: string) {
  const results: { id: string; vector: number[] }[] = [];
  
  for (const chunk of chunks) {
    const payload = {
      model: "models/text-embedding-004",
      content: { parts: [{ text: chunk.text }] }
    };

    const res = await fetch(`${EMBEDDING_API}?key=${apiKey}`, {
      method: "POST",
      headers: { "Content-Type": "application/json" },
      body: JSON.stringify(payload)
    });

    const data = await res.json();
    if (data.embedding?.values) {
      results.push({ id: chunk.id, vector: data.embedding.values });
    }
  }
  return results;
}

Vectors are upserted to Cloudflare Vectorize using the wrangler CLI or programmatic API. The index uses cosine similarity, which performs reliably for semantic matching across short text segments.

3. Edge Query Router

The worker acts as the orchestration layer. It embeds incoming queries, retrieves relevant chunks, applies safety filters, and constructs a grounded prompt for the LLM.

// src/worker.ts
export interface Env {
  GEMINI_KEY: string;
  VECTORIZE_INDEX: VectorizeIndex;
  CLINIC_NAME: string;
  SUPPORT_CONTACT: string;
}

export default {
  async fetch(request: Request, env: Env): Promise<Response> {
    if (request.method !== "POST") return new Response("Method not allowed", { status: 405 });

    const { query, sessionHistory = [] } = await request.json();
    if (!query || typeof query !== "string") {
      return jsonResponse({ error: "Missing query parameter" }, 400);
    }

    // 1. Embed query
    const queryEmbedding = await embedText(query, env.GEMINI_KEY);

    // 2. Retrieve context
    const retrieval = await env.VECTORIZE_INDEX.query(queryEmbedding, {
      topK: 5,
      returnMetadata: true
    });

    const contextBlocks = retrieval.matches
      .filter(m => m.score >= 0.52)
      .map(m => `[Source: ${m.metadata.url}]\n${m.metadata.text}`)
      .join("\n\n");

    // 3. Safety & defer logic
    const emergencyKeywords = /pain|emergency|chest|bleeding|dizzy|severe|allergic/i;
    if (retrieval.matches.length === 0 || retrieval.matches[0].score < 0.52 || emergencyKeywords.test(query)) {
      return jsonResponse({
        answer: `I cannot verify that information securely. Please contact ${env.SUPPORT_CONTACT} for immediate assistance.`,
        defer: true
      });
    }

    // 4. Grounded generation
    const prompt = `You are a support assistant for ${env.CLINIC_NAME}. 
Rules:
- Answer ONLY using the provided context.
- If the context lacks the answer, state that clearly.
- Maintain a professional, concise tone.
- Never provide medical advice or diagnostic information.

Context:
${contextBlocks || "No relevant context found."}

Conversation History:
${sessionHistory.map((h: { role: string; content: string }) => `${h.role}: ${h.content}`).join("\n")}

User: ${query}
Assistant:`;

    const llmResponse = await fetch(
      `https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-flash:generateContent?key=${env.GEMINI_KEY}`,
      {
        method: "POST",
        headers: { "Content-Type": "application/json" },
        body: JSON.stringify({
          contents: [{ parts: [{ text: prompt }] }],
          generationConfig: { temperature: 0.2, maxOutputTokens: 350 }
        })
      }
    );

    const llmData = await llmResponse.json();
    const generatedText = llmData.candidates?.[0]?.content?.parts?.[0]?.text || "Unable to generate response.";

    return jsonResponse({
      answer: generatedText,
      sources: retrieval.matches.map(m => m.metadata.url),
      confidence: retrieval.matches[0]?.score || 0
    });
  }
};

async function embedText(text: string, key: string): Promise<number[]> {
  const res = await fetch(`https://generativelanguage.googleapis.com/v1beta/models/text-embedding-004:embedContent?key=${key}`, {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({ model: "models/text-embedding-004", content: { parts: [{ text }] } })
  });
  const data = await res.json();
  return data.embedding?.values || [];
}

function jsonResponse(body: any, status = 200) {
  return new Response(JSON.stringify(body), {
    status,
    headers: { "Content-Type": "application/json", "Access-Control-Allow-Origin": "*" }
  });
}

Architecture Rationale: The 0.52 cosine similarity threshold balances recall and precision. Lower thresholds retrieve noisy matches; higher thresholds trigger excessive deferrals. Keyword-based deferral acts as a deterministic safety layer before LLM invocation. temperature: 0.2 minimizes creative variance, ensuring factual consistency. Session history is passed as plain text to avoid complex serialization overhead while maintaining conversational continuity.

4. Isolated Client Widget

The frontend must not interfere with host site styles or JavaScript execution. A shadow DOM encapsulation strategy guarantees CSS isolation and prevents event listener collisions.

// public/widget.js
(function() {
  const CONFIG = {
    workerEndpoint: "https://your-worker.your-subdomain.workers.dev",
    theme: { primary: "#0F766E", bg: "#FFFFFF", text: "#1E293B" }
  };

  const container = document.createElement("div");
  const shadow = container.attachShadow({ mode: "open" });

  shadow.innerHTML = `
    <style>
      :host { all: initial; font-family: system-ui, sans-serif; }
      .trigger { position: fixed; bottom: 20px; right: 20px; width: 52px; height: 52px; border-radius: 50%; background: ${CONFIG.theme.primary}; color: white; display: flex; align-items: center; justify-content: center; cursor: pointer; box-shadow: 0 4px 10px rgba(0,0,0,0.2); z-index: 9999; font-weight: 600; }
      .panel { position: fixed; bottom: 84px; right: 20px; width: 340px; height: 460px; background: ${CONFIG.theme.bg}; border-radius: 10px; box-shadow: 0 6px 18px rgba(0,0,0,0.15); display: none; flex-direction: column; overflow: hidden; }
      .panel.active { display: flex; }
      .messages { flex: 1; overflow-y: auto; padding: 10px; display: flex; flex-direction: column; gap: 8px; }
      .bubble { padding: 8px 12px; border-radius: 10px; max-width: 85%; font-size: 13px; line-height: 1.4; }
      .bubble.user { align-self: flex-end; background: ${CONFIG.theme.primary}; color: white; }
      .bubble.assistant { align-self: flex-start; background: #F8FAFC; color: ${CONFIG.theme.text}; border: 1px solid #E2E8F0; }
      .input-area { display: flex; padding: 8px; border-top: 1px solid #E2E8F0; }
      .input-area input { flex: 1; padding: 8px; border: 1px solid #CBD5E1; border-radius: 6px; font-size: 13px; outline: none; }
      .input-area button { margin-left: 6px; padding: 8px 14px; background: ${CONFIG.theme.primary}; color: white; border: none; border-radius: 6px; cursor: pointer; }
    </style>
    <div class="trigger" id="trigger">💬</div>
    <div class="panel" id="panel">
      <div class="messages" id="messages"></div>
      <form class="input-area" id="form">
        <input id="input" type="text" placeholder="Ask a question..." required />
        <button type="submit">Send</button>
      </form>
    </div>
  `;

  document.body.appendChild(container);

  const $ = (sel) => shadow.querySelector(sel);
  let history = [];

  $("#trigger").addEventListener("click", () => $("#panel").classList.toggle("active"));
  
  $("#form").addEventListener("submit", async (e) => {
    e.preventDefault();
    const input = $("#input");
    const text = input.value.trim();
    if (!text) return;

    input.value = "";
    appendMessage("user", text);
    history.push({ role: "user", content: text });

    try {
      const res = await fetch(CONFIG.workerEndpoint, {
        method: "POST",
        headers: { "Content-Type": "application/json" },
        body: JSON.stringify({ query: text, sessionHistory: history.slice(-4) })
      });
      const data = await res.json();
      appendMessage("assistant", data.answer);
      history.push({ role: "assistant", content: data.answer });
    } catch (err) {
      appendMessage("assistant", "Connection failed. Please try again later.");
    }
  });

  function appendMessage(role, text) {
    const div = document.createElement("div");
    div.className = `bubble ${role}`;
    div.textContent = text;
    $("#messages").appendChild(div);
    $("#messages").scrollTop = $("#messages").scrollHeight;
  }
})();

Architecture Rationale: The widget uses an IIFE to avoid global namespace pollution. Shadow DOM guarantees style isolation, which is critical for legacy themes with aggressive CSS resets. History is capped at 4 turns to prevent context window overflow and reduce token consumption. All user input is rendered as textContent, eliminating XSS vectors without requiring complex sanitization libraries.

Pitfall Guide

Pitfall	Explanation	Fix
Fixed-Character Chunking	Splitting text at arbitrary byte boundaries fractures sentences and destroys semantic coherence, causing retrieval failures.	Use paragraph-aware splitting with 10–15% overlap. Preserve punctuation boundaries and strip whitespace aggressively.
Ignoring Embedding Drift	Vector indices become stale as site content updates. Unversioned indexes return outdated answers without warning.	Implement index versioning (e.g., `rag-v2-2024`). Schedule monthly re-indexing via CI cron and swap index references atomically.
Over-Confident Retrieval Thresholds	Cosine similarity scores vary by domain and embedding model. Hardcoding `0.7` often triggers false deferrals on niche queries.	Calibrate thresholds empirically. Start at `0.50–0.55` for short-form content. Log score distributions and adjust based on false-positive rates.
Unbounded Conversation History	Passing full chat history inflates context windows, increases latency, and spikes token costs. LLMs degrade with excessive turns.	Truncate to last 3–4 exchanges. Implement a sliding window. Summarize older turns only if conversational continuity is critical.
CSS Namespace Collisions	Injecting widgets into legacy themes causes style bleeding, breaking host layouts or hiding UI elements.	Always use Shadow DOM. Apply `all: initial` to the host element. Avoid global class names; scope everything to the shadow root.
Missing Rate Limit Handling	Sequential API calls trigger `429 Too Many Requests` errors, halting indexing or query processing.	Implement exponential backoff with jitter. Batch embedding requests where API supports it. Queue worker invocations during traffic spikes.
Hard-Coding Safety Rules	Keyword matching misses nuanced medical, financial, or legal queries that require deferral.	Add a lightweight classification step before generation. Use a separate low-cost model or rule-based classifier to flag sensitive domains.

Production Bundle

Action Checklist

Audit existing site content for structured markup; add semantic tags if missing to improve extraction accuracy.
Configure wrangler.toml with environment variables for API keys and vector index bindings.
Run the ingestion script against a staging environment; validate chunk quality and metadata completeness.
Deploy the worker to a preview route; test query routing, confidence thresholds, and safety deferrals.
Inject the widget script into the production footer; verify shadow DOM isolation across major browsers.
Set up a monthly CI job to re-crawl, re-embed, and upsert vectors; monitor index size and query latency.
Implement logging for retrieval scores, deferral rates, and token consumption to track cost drift.

Decision Matrix

Scenario	Recommended Approach	Why	Cost Impact
<500 queries/month	Edge-Native RAG (Workers + Vectorize)	Free tier covers all operations; zero infra management	$0
500–5,000 queries/month	Edge-Native RAG with paid Vectorize	Predictable scaling; maintains grounding accuracy	$3–$8/mo
>5,000 queries/month	Dedicated vector DB (Pinecone/Weaviate) + Workers	Higher throughput; better concurrency handling	$20–$50/mo
Multi-language support	`text-embedding-004` + Gemini 2.5 Flash	Native multilingual alignment; no translation overhead	$0–$5/mo
Strict compliance (HIPAA/GDPR)	Self-hosted embeddings + encrypted vector store	Data residency control; audit trail requirements	$50–$150/mo

Configuration Template

# wrangler.toml
name = "ai-support-router"
main = "src/worker.ts"
compatibility_date = "2024-06-01"

[vars]
CLINIC_NAME = "Your Organization"
SUPPORT_CONTACT = "support@example.com"

[[vectorize]]
binding = "VECTORIZE_INDEX"
index_name = "site-content-v1"
namespace = "production"

[observability]
enabled = true

// env.d.ts
interface Env {
  GEMINI_KEY: string;
  VECTORIZE_INDEX: VectorizeIndex;
  CLINIC_NAME: string;
  SUPPORT_CONTACT: string;
}

Quick Start Guide

Initialize the project: Run npm create cloudflare@latest ai-bot -- --type=worker and install dependencies (cheerio, typescript).
Configure credentials: Add GEMINI_KEY to your .dev.vars file and run wrangler secret put GEMINI_KEY.
Build the index: Execute the ingestion script locally, verify chunk output, then run wrangler vectorize create site-content-v1 --dimensions=768 --metric=cosine followed by the upsert routine.
Deploy & integrate: Run wrangler deploy, copy the worker URL, paste it into the widget configuration object, and append the script tag to your site’s footer. Verify functionality in an incognito window.

How to add a Gemini-powered chatbot to any legacy site in ~2 hours (with code)