Building RAG Assistant: A I Built a Desktop RAG Chatbot From Scratch — Here's Everything I Learned

This week I moved past basic tutorials to build a structured, package-based desktop RAG application with dynamic document ingestion, a custom GUI, and a modular Python backend.

When starting out with Generative AI and Retrieval-Augmented Generation (RAG), it's easy to fall into the "what is this and what am I even doing" beginner trap. You run the tutorial, it works, you feel great — and then you have no idea how to build anything real with it.

Inspired by Pixegami's foundational LangChain RAG tutorials, I decided to take things further. I built Cosmic RAG Assistant, a desktop-native RAG chatbot with a complete GUI, dynamic data ingestion pipelines, and a highly structured, modular backend.

Here is exactly how I built it, what I learned about data management, and the engineering principles I applied to make it robust.

What Is RAG, Actually?

Large Language Models like GPT-4 are trained on enormous datasets, but that training has a hard cutoff. They have no knowledge of your private documents, your company's internal wiki, your research notes, or anything you haven't explicitly shown them.

Retrieval-Augmented Generation (RAG) solves this by adding a retrieval step before generation. Instead of asking the model to answer from memory alone, you:

Pre-process your documents and store them in a searchable vector database
At query time, search that database for the most semantically relevant chunks of text
Inject those chunks as context into the LLM's prompt
Let the model generate a grounded answer using your actual source material

The model is no longer guessing from training data — it's reading your documents and synthesizing an answer. This is the core loop that makes RAG so powerful for knowledge-intensive applications.

User Question
     │
     ▼
[Embed the query] ──► Vector Search ──► Top-K Relevant Chunks
                                               │
                                               ▼
                                    [Inject into prompt]
                                               │
                                               ▼
                                        LLM generates answer

The Vector Database: What Is Chroma and Why Does It Matter?

Traditional search engines work on keywords. If you search for "Inverted Fullback ," a keyword search won't return a document that says "fullback who stepped into midfield " — even though they mean the same thing.

Vector databases solve this with embeddings. An embedding model converts any piece of text into a dense numerical vector — typically 1,536 numbers for OpenAI's text-embedding-3-small. The magic is that semantically similar texts produce vectors that are geometrically close to each other in high-dimensional space.

Chroma is an open-source, locally persistent vector database. In this app, it serves as the core memory layer:

It stores each document chunk alongside its embedding vector
When you ask a question, it converts your question to an embedding too, then uses cosine similarity to find the stored chunks whose vectors are closest to your query vector
It returns the top-K most relevant chunks with relevance scores

This is why the system can answer "how should the fullback be positioned in defense and attack in the inverted fullback role?" even if your document uses the phrase "Invert" or "overlapp" — the embedding model understands they are semantically equivalent.

In the codebase, Chroma is initialized and queried like this:

db = Chroma(
    collection_name=COLLECTION_NAME,
    persist_directory=str(CHROMA_DIR),
    embedding_function=OpenAIEmbeddings(),
)
results = db.similarity_search_with_relevance_scores(query_text, k=4)

The persist_directory is critical — it means the vector store survives between app restarts. No re-indexing every time you open the app.

Chunking: The Most Underrated Step in the Entire Pipeline

Before any embedding can happen, your raw documents have to be broken into smaller pieces. This is called chunking, and it is the step that most beginners get wrong.

Here's the problem: LLMs and embedding models have context windows. You can't embed a 200-page PDF as a single unit. And even if you could, a huge blob of text makes terrible retrieval — if you retrieve the whole document every time, your context window fills up with irrelevant content.

The goal of chunking is to cut your documents into the smallest units that are still self-contained and meaningful.

This app uses LangChain's RecursiveCharacterTextSplitter:

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=100,
    length_function=len,
    add_start_index=True,
)

Here is what each parameter does:

chunk_size=1000 — Each chunk is at most 1,000 characters. This is roughly 200–250 words, small enough to be precise in retrieval but large enough to carry full sentences and context.

chunk_overlap=100 — Consecutive chunks share 100 characters of content. This is the key to preventing information loss at boundaries. Imagine a critical sentence that falls exactly at the border between chunk 3 and chunk 4. Without overlap, neither chunk contains the complete thought. With 100 characters of overlap, that sentence exists in both, ensuring at least one gets retrieved.

add_start_index=True — Each chunk's metadata includes its character offset in the original document. This enables precise source attribution — you can tell the user not just which file the answer came from, but exactly where in that file.

The "Recursive" Part: The splitter doesn't just blindly cut every 1,000 characters. It works through a hierarchy of natural separators in order: \n\n (paragraphs), then \n (lines), then spaces (words), then individual characters as a last resort. This means it always tries to make cuts at the most natural linguistic boundary possible.

Technical Architecture & Directory Design

One of my primary goals was strict separation of concerns. Instead of throwing all scripts in the root folder, the app logic lives within a clean Python package (rag_app/), exposed through single-responsibility entry-point scripts.

RAG-Application/
  Assets/              # UI media and background assets
  chroma/              # Persistent local vector store
  rag_app/             # Core application package
    __init__.py
    __main__.py
    config.py          # Environment variables & global constants
    ingestion.py       # File processing & text splitting
    rag_service.py     # Embedding generation & vector querying
    ui.py              # Tkinter front-end & chat interface
  ask_rag.py           # CLI: one-off terminal queries
  index_sources.py     # Background task: rebuilds Chroma DB
  run_chat_app.py      # Entry point: launches the desktop app
  GUI.py               # Legacy compatibility wrapper
  .env                 # Protected API keys

`config.py` — The Single Source of Truth

Every path and constant lives here. No magic strings scattered across files.

BASE_DIR = Path(__file__).resolve().parent.parent
SOURCES_DIR = DATA_DIR / "sources"
CHROMA_DIR = BASE_DIR / "chroma"
SUPPORTED_EXTENSIONS = {".pdf", ".md", ".txt"}
COLLECTION_NAME = "rag_documents"

This also means the app is relocatable — move the folder, and all paths still resolve correctly because they are derived from __file__, not hardcoded absolute paths.

`ingestion.py` — The Data Pipeline

This module owns everything from "user selects files" to "chunks ready to embed." Key design decisions:

Timestamped source folders — every upload batch gets its own folder (source_20260515_125311/). This creates a natural audit trail: you can see exactly when each set of documents was added, and rolling back is as simple as deleting a folder before re-indexing.

Legacy migration — if any files were placed directly in the Data/ root (an older pattern), the app automatically moves them into source_legacy/ on startup. Users don't lose data when directory structure changes.

Strict type enforcement per batch — you can't mix PDFs and Markdown in a single upload. This prevents loader mismatches and makes debugging ingestion problems straightforward.

`rag_service.py` — The Intelligence Layer

This is where retrieval and generation actually happen. Two functions do all the work:

rebuild_vector_store() — Discovers all documents, splits them into chunks, embeds every chunk via OpenAIEmbeddings, and writes everything to Chroma. The collection is deleted and rebuilt from scratch on each call, which ensures the index is always consistent with what's on disk.

ask_question() — Takes a user query, embeds it, runs similarity search with relevance scores, and filters out weak matches:

if not results or results[0][1] < 0.4:
    return {"answer": "I don't know based on the uploaded documents.", ...}

That 0.4 threshold is important. Without it, the model will always retrieve something and hallucinate an answer even when the documents contain nothing relevant. The threshold forces honest "I don't know" responses.

The prompt template keeps the model grounded:

PROMPT_TEMPLATE = """
Use the following context to answer the question.
If the answer is not in the context, say "I don't know".

Context: {context}

Question: {question}
"""

`ui.py` — The Desktop GUI

Built entirely with Tkinter and ttk, no web framework needed. The app runs natively on any platform Python supports.

The UI flow mirrors the RAG pipeline itself:

Upload Files → calls copy_files_to_source(), saves files into a timestamped batch folder
Index Sources → calls rebuild_vector_store(), walks all batch folders and rebuilds Chroma
Ask → calls ask_question(), renders the answer and source citations in the chat view

Chat bubbles are right-aligned for user messages and left-aligned for assistant responses — standard messenger UX, built entirely with tk.Label and ttk.Frame.

The Full Data Flow, End to End

1. User uploads files via GUI
         │
         ▼
2. Files copied to Data/sources/source_YYYYMMDD_HHMMSS/

3. User clicks "Index Sources"
         │
         ▼
4. ingestion.discover_documents()
   └─ Recursively scans all source folders
   └─ Loads PDFs via PyPDFLoader, text/md via TextLoader

5. ingestion.split_documents()
   └─ RecursiveCharacterTextSplitter
   └─ chunk_size=1000, chunk_overlap=100
   └─ Produces N chunks with start_index metadata

6. OpenAIEmbeddings converts each chunk → 1536-dim vector

7. Chroma stores (chunk_text, vector, metadata) on disk

8. User asks a question
         │
         ▼
9. Query text → OpenAIEmbeddings → query vector

10. Chroma similarity_search_with_relevance_scores(k=4)
    └─ Returns top 4 chunks by cosine similarity
    └─ Filters out chunks with score < 0.4

11. Chunks injected into PROMPT_TEMPLATE as {context}

12. ChatOpenAI generates grounded answer

13. Answer + source file paths + relevance scores returned to UI

What I'd Do Differently Next Time

Async indexing — Right now the indexing step blocks the UI thread. For large document sets, this freezes the window. Moving rebuild_vector_store() into a background thread with a progress indicator would be the right fix.

Incremental indexing — Every index rebuild deletes and recreates the entire collection. For large corpora, it would be much faster to hash each source batch folder and only embed changed files.

Metadata-filtered search — Chroma supports filtering by metadata at query time. With the start_index and source metadata already in place, you could easily add a document selector to the UI letting users query only specific uploaded batches.

Stack Summary

Component

Tool

LLM

OpenAI GPT (via langchain_openai.ChatOpenAI)

Embeddings

langchain_openai.OpenAIEmbeddings

Vector DB

Chroma (local, persistent)

Document loading

PyPDFLoader, TextLoader (LangChain community)

Text splitting

RecursiveCharacterTextSplitter

GUI

Tkinter + ttk + Pillow

Orchestration

LangChain Core

Config

python-dotenv

If you're learning RAG, the best advice I can give is: stop following tutorials and build something structured from the ground up. The moment you have to think about where your data lives, how it gets there, and what happens when something goes wrong — that's when you actually understand it.

as a famous brand once said "Just Do It !"

The full source is on GitHub . Drop questions in the comments — happy to dig into any part of the implementation.

Mid-Year Sale — Unlock Full Article