Building RAG Assistant: A I Built a Desktop RAG Chatbot From Scratch β Here's Everything I Learned
This week I moved past basic tutorials to build a structured, package-based desktop RAG application with dynamic document ingestion, a custom GUI, and a modular Python backend.
When starting out with Generative AI and Retrieval-Augmented Generation (RAG), it's easy to fall into the "what is this and what am I even doing" beginner trap. You run the tutorial, it works, you feel great β and then you have no idea how to build anything real with it.
Inspired by Pixegami's foundational LangChain RAG tutorials, I decided to take things further. I built Cosmic RAG Assistant, a desktop-native RAG chatbot with a complete GUI, dynamic data ingestion pipelines, and a highly structured, modular backend.
Here is exactly how I built it, what I learned about data management, and the engineering principles I applied to make it robust.
What Is RAG, Actually?
Large Language Models like GPT-4 are trained on enormous datasets, but that training has a hard cutoff. They have no knowledge of your private documents, your company's internal wiki, your research notes, or anything you haven't explicitly shown them.
Retrieval-Augmented Generation (RAG) solves this by adding a retrieval step before generation. Instead of asking the model to answer from memory alone, you:
- Pre-process your documents and store them in a searchable vector database
- At query time, search that database for the most semantically relevant chunks of text
- Inject those chunks as context into the LLM's prompt
- Let the model generate a grounded answer using your actual source material
The model is no longer guessing from training data β it's reading your documents and synthesizing an answer. This is the core loop that makes RAG so powerful for knowledge-intensive applications.
User Question
β
βΌ
[Embed the query] βββΊ Vector Search βββΊ Top-K Relevant Chunks
β
βΌ
[Inject into prompt]
β
βΌ
LLM generates answer
The Vector Database: What Is Chroma and Why Does It Matter?
Traditional search engines work on keywords. If you search for "Inverted Fullback ," a keyword search won't return a document that says "fullback who stepped into midfield " β even though they mean the same thing.
Vector databases solve this with embeddings. An embedding model converts any piece of text into a dense numerical vector β typically 1,536 numbers for OpenAI's text-embedding-3-small. The magic is that semantically similar texts produce vectors that are geometrically close to each other in high-dimensional space.
Chroma is an open-source, locally persistent vector database. In this app, it serves as the core memory layer:
- It stores each document chunk alongside its embedding vector
- When you ask a question, it converts your question to an embedding too, then uses cosine similarity to find the stored chunks whose vectors are closest to your query vector
- It returns the top-K most relevant chunks with relevance scores
This is why the system can answer "how should the fullback be positioned in defense and attack in the inverted fullback role?" even if your document uses the phrase "Invert" or "overlapp" β the embedding model understands they are semantically equivalent.
In the codebase, Chroma is initialized and queried like this:
db = Chroma(
collection_name=COLLECTION_NAME,
persist_directory=str(CHROMA_DIR),
embedding_function=OpenAIEmbeddings(),
)
results = db.similarity_search_with_relevance_scores(query_text, k=4)
The persist_directory is critical β it means the vector store survives between app restarts. No re-indexing every time you open the app.
Chunking: The Most Underrated Step in the Entire Pipeline
Before any embedding can happen, your raw documents have to be broken into smaller pieces. This is called chunking, and it is the step that most beginners get wrong.
Here's the problem: LLMs and embedding models have context windows. You can't embed a 200-page PDF as a single unit. And even if you could, a huge blob of text makes terrible retrieval β if you retrieve the whole document every time, your context window fills up with irrelevant content.
The goal of chunking is to cut your documents into the smallest units that are still self-contained and meaningful.
This app uses LangChain's RecursiveCharacterTextSplitter:
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=100,
length_function=len,
add_start_index=True,
)
Here is what each parameter does:
chunk_size=1000 β Each chunk is at most 1,000 characters. This is roughly 200β250 words, small enough to be precise in retrieval but large enough to carry full sentences and context.
chunk_overlap=100 β Consecutive chunks share 100 characters of content. This is the key to preventing information loss at boundaries. Imagine a critical sentence that falls exactly at the border between chunk 3 and chunk 4. Without overlap, neither chunk contains the complete thought. With 100 characters of overlap, that sentence exists in both, ensuring at least one gets retrieved.
add_start_index=True β Each chunk's metadata includes its character offset in the original document. This enables precise source attribution β you can tell the user not just which file the answer came from, but exactly where in that file.
The "Recursive" Part: The splitter doesn't just blindly cut every 1,000 characters. It works through a hierarchy of natural separators in order: \n\n (paragraphs), then \n (lines), then spaces (words), then individual characters as a last resort. This means it always tries to make cuts at the most natural linguistic boundary possible.
Technical Architecture & Directory Design
One of my primary goals was strict separation of concerns. Instead of throwing all scripts in the root folder, the app logic lives within a clean Python package (rag_app/), exposed through single-responsibility entry-point scripts.
RAG-Application/
Assets/ # UI media and background assets
chroma/ # Persistent local vector store
rag_app/ # Core application package
__init__.py
__main__.py
config.py # Environment variables & global constants
ingestion.py # File processing & text splitting
rag_service.py # Embedding generation & vector querying
ui.py # Tkinter front-end & chat interface
ask_rag.py # CLI: one-off terminal queries
index_sources.py # Background task: rebuilds Chroma DB
run_chat_app.py # Entry point: launches the desktop app
GUI.py # Legacy compatibility wrapper
.env # Protected API keys
config.py β The Single Source of Truth
Every path and constant lives here. No magic strings scattered across files.
BASE_DIR = Path(__file__).resolve().parent.parent
SOURCES_DIR = DATA_DIR / "sources"
CHROMA_DIR = BASE_DIR / "chroma"
SUPPORTED_EXTENSIONS = {".pdf", ".md", ".txt"}
COLLECTION_NAME = "rag_documents"
This also means the app is relocatable β move the folder, and all paths still resolve correctly because they are derived from __file__, not hardcoded absolute paths.
ingestion.py β The Data Pipeline
This module owns everything from "user selects files" to "chunks ready to embed." Key design decisions:
Timestamped source folders β every upload batch gets its own folder (source_20260515_125311/). This creates a natural audit trail: you can see exactly when each set of documents was added, and rolling back is as simple as deleting a folder before re-indexing.
Legacy migration β if any files were placed directly in the Data/ root (an older pattern), the app automatically moves them into source_legacy/ on startup. Users don't lose data when directory structure changes.
Strict type enforcement per batch β you can't mix PDFs and Markdown in a single upload. This prevents loader mismatches and makes debugging ingestion problems straightforward.
rag_service.py β The Intelligence Layer
This is where retrieval and generation actually happen. Two functions do all the work:
rebuild_vector_store() β Discovers all documents, splits them into chunks, embeds every chunk via OpenAIEmbeddings, and writes everything to Chroma. The collection is deleted and rebuilt from scratch on each call, which ensures the index is always consistent with what's on disk.
ask_question() β Takes a user query, embeds it, runs similarity search with relevance scores, and filters out weak matches:
if not results or results[0][1] < 0.4:
return {"answer": "I don't know based on the uploaded documents.", ...}
That 0.4 threshold is important. Without it, the model will always retrieve something and hallucinate an answer even when the documents contain nothing relevant. The threshold forces honest "I don't know" responses.
The prompt template keeps the model grounded:
PROMPT_TEMPLATE = """
Use the following context to answer the question.
If the answer is not in the context, say "I don't know".
Context: {context}
Question: {question}
"""
ui.py β The Desktop GUI
Built entirely with Tkinter and ttk, no web framework needed. The app runs natively on any platform Python supports.
The UI flow mirrors the RAG pipeline itself:
- Upload Files β calls
copy_files_to_source(), saves files into a timestamped batch folder - Index Sources β calls
rebuild_vector_store(), walks all batch folders and rebuilds Chroma - Ask β calls
ask_question(), renders the answer and source citations in the chat view
Chat bubbles are right-aligned for user messages and left-aligned for assistant responses β standard messenger UX, built entirely with tk.Label and ttk.Frame.
The Full Data Flow, End to End
1. User uploads files via GUI
β
βΌ
2. Files copied to Data/sources/source_YYYYMMDD_HHMMSS/
3. User clicks "Index Sources"
β
βΌ
4. ingestion.discover_documents()
ββ Recursively scans all source folders
ββ Loads PDFs via PyPDFLoader, text/md via TextLoader
5. ingestion.split_documents()
ββ RecursiveCharacterTextSplitter
ββ chunk_size=1000, chunk_overlap=100
ββ Produces N chunks with start_index metadata
6. OpenAIEmbeddings converts each chunk β 1536-dim vector
7. Chroma stores (chunk_text, vector, metadata) on disk
8. User asks a question
β
βΌ
9. Query text β OpenAIEmbeddings β query vector
10. Chroma similarity_search_with_relevance_scores(k=4)
ββ Returns top 4 chunks by cosine similarity
ββ Filters out chunks with score < 0.4
11. Chunks injected into PROMPT_TEMPLATE as {context}
12. ChatOpenAI generates grounded answer
13. Answer + source file paths + relevance scores returned to UI
What I'd Do Differently Next Time
Async indexing β Right now the indexing step blocks the UI thread. For large document sets, this freezes the window. Moving rebuild_vector_store() into a background thread with a progress indicator would be the right fix.
Incremental indexing β Every index rebuild deletes and recreates the entire collection. For large corpora, it would be much faster to hash each source batch folder and only embed changed files.
Metadata-filtered search β Chroma supports filtering by metadata at query time. With the start_index and source metadata already in place, you could easily add a document selector to the UI letting users query only specific uploaded batches.
Stack Summary
Component
Tool
LLM
OpenAI GPT (via langchain_openai.ChatOpenAI)
Embeddings
langchain_openai.OpenAIEmbeddings
Vector DB
Chroma (local, persistent)
Document loading
PyPDFLoader, TextLoader (LangChain community)
Text splitting
RecursiveCharacterTextSplitter
GUI
Tkinter + ttk + Pillow
Orchestration
LangChain Core
Config
python-dotenv
If you're learning RAG, the best advice I can give is: stop following tutorials and build something structured from the ground up. The moment you have to think about where your data lives, how it gets there, and what happens when something goes wrong β that's when you actually understand it.
as a famous brand once said "Just Do It !"
The full source is on GitHub . Drop questions in the comments β happy to dig into any part of the implementation.
Mid-Year Sale β Unlock Full Article
Base plan from just $4.99/mo or $49/yr
Sign in to read the full article and unlock all tutorials.
Sign In / Register β Start Free Trial7-day free trial Β· Cancel anytime Β· 30-day money-back

