Difficulty

Beginner

Read Time

65 min

RAG - Complete Practical Guide

By Codcompass Team·2026-05-17·65 min read

Introduction

Retrieval Augmented Generation, is one of the biggest pillars in todays AI field. Mainly used by big companies for better internal gestion and retrieval of documents.
In this article I will be explaing some RAG concepts with code snippets for a better grasp, and also be talking about some common problems I faced when implementing my own RAG, and presenting some solutions all along.

What is RAG?

RAG (Retrieval Augmented Generation) is a system design pattern that combines:

Information retrieval (finding relevant knowledge)
Large Language Models (LLMs) (generating responses)

Instead of relying only on what the model had learned during training, a RAG system retrieves external knowledge and injects it into the prompt.

Traditional LLM

Question
   ↓
Model Memory (Training Data)
   ↓
Answer

Enter fullscreen mode Exit fullscreen mode

Problem:

knowledge can be outdated
hallucinations happen
cannot access private company data

RAG based LLM

Question
   ↓
Retrieve Relevant Knowledge
   ↓
Add Context to Prompt
   ↓
LLM Generates Grounded Answer

Enter fullscreen mode Exit fullscreen mode

This makes answers:

more accurate
grounded in documents
customizable
domain-specific

Why RAG?

LLMs are powerful but limited.

Common problems:

1. Hallucinations

The model invents facts.

Example:

Question:
Who founded Company X?

Answer:
John Smith.

Enter fullscreen mode Exit fullscreen mode

Even if John Smith never existed.

2. Knowledge Cutoff

Models only know what they were trained on.

They do not automatically know:

your PDFs
internal documentation
GitHub repositories
recent updates

3. Private Data

Businesses need AI over:

internal docs
policies
tickets
codebases

RAG solves this.

Core Architecture

A RAG system usually contains:

Documents
Chunking system
Embedding model
Vector database
Retriever
Prompt constructor
LLM

Architecture:

Documents
   ↓
Chunking
   ↓
Embeddings
   ↓
Vector Database

User Question
   ↓
Question Embedding
   ↓
Similarity Search
   ↓
Relevant Chunks
   ↓
Prompt Construction
   ↓
LLM
   ↓
Answer

Enter fullscreen mode Exit fullscreen mode

How RAG Works Step by Step

1. Documents

The system starts with raw documents.

Examples:

TXT files
PDFs
Markdown files
HTML pages
GitHub repos

Example text:

RAG systems use vector databases to retrieve
relevant information for LLMs.

Enter fullscreen mode Exit fullscreen mode

2. Chunking

Documents are split into smaller sections.

Why?

Embedding entire books is ineffective.

Instead:

Large Document
   ↓
Small Chunks

Enter fullscreen mode Exit fullscreen mode

Example:

Chunk 1 → Intro
Chunk 2 → Embeddings
Chunk 3 → Pinecone

Enter fullscreen mode Exit fullscreen mode

[](#3-embe

ddings)3. Embeddings

Every chunk becomes a vector.

Example:

"RAG systems use retrieval"

Enter fullscreen mode Exit fullscreen mode

becomes:

[0.12, -0.77, 0.48, ...]

Enter fullscreen mode Exit fullscreen mode

4. Store in Vector Database

Vectors are stored in:

Pinecone
Weaviate
Qdrant
Chroma
FAISS

5. User Question

Example:

What are embeddings?

Enter fullscreen mode Exit fullscreen mode

Question becomes a vector too.

6. Similarity Search

The vector database finds:

Most similar chunks

Enter fullscreen mode Exit fullscreen mode

based on mathematical similarity.

7. Prompt Construction

Retrieved chunks are injected into prompt.

Example:

Context:
Embeddings are vector representations.

Question:
What are embeddings?

Enter fullscreen mode Exit fullscreen mode

8. LLM Generation

The LLM generates an answer using retrieved context.

Key Concepts and Definitions

1. Embedding

A numerical semantic representation of text.

Example:

"Machine learning"
↓
[0.12, -0.34, ...]

Enter fullscreen mode Exit fullscreen mode

Purpose:

semantic understanding
similarity search

2. Vector

An ordered list of numbers.

Example:

[0.12, -0.55, 0.91]

Enter fullscreen mode Exit fullscreen mode

3. Dimension

The number of values inside a vector.

Example:

768-dimensional vector

Enter fullscreen mode Exit fullscreen mode

means:

768 numbers

Enter fullscreen mode Exit fullscreen mode

Why it matters:

Your vector DB dimension must match embedding dimension.

Example:

nomic-embed-text → 768
Pinecone index → must be 768

Enter fullscreen mode Exit fullscreen mode

4. Semantic Search

Search by meaning.

Not exact keywords.

Example:

Question:

How does memory work?

Enter fullscreen mode Exit fullscreen mode

Can retrieve:

Agents retain context using memory systems.

Enter fullscreen mode Exit fullscreen mode

5. Similarity Score

Measures closeness between vectors.

Higher score:

More relevant

Enter fullscreen mode Exit fullscreen mode

Top-K

How many results to retrieve.

Example:

top_k=5

Enter fullscreen mode Exit fullscreen mode

Means:

Return best 5 chunks

Enter fullscreen mode Exit fullscreen mode

6. Metadata

Extra information attached to vectors.

Example:

{
  "text": "Embeddings are vectors",
  "source": "notes.txt",
  "topic": "rag"
}

Enter fullscreen mode Exit fullscreen mode

Embeddings Explained

Embeddings convert text into mathematical meaning.

Texts with similar meanings end up close together.

Example:

"How to build AI agents"

Enter fullscreen mode Exit fullscreen mode

and

"Creating autonomous agents"

Enter fullscreen mode Exit fullscreen mode

become nearby vectors.

Generating Embeddings with Ollama

import ollama


def generate_embedding(text):
    response = ollama.embeddings(
        model="nomic-embed-text",
        prompt=text
    )

    return response["embedding"]

Enter fullscreen mode Exit fullscreen mode

Test:

embedding = generate_embedding(
    "What is RAG?"
)

print(len(embedding))
print(embedding[:10])

Enter fullscreen mode Exit fullscreen mode

The code snippets seen above are from a RAG project I implemented, you can view the source code here

Vector Databases

A vector database stores embeddings.

Traditional DB:

Search by exact values

Enter fullscreen mode Exit fullscreen mode

Vector DB:

Search by similarity

Enter fullscreen mode Exit fullscreen mode

Common vector DBs:

Pinecone
Qdrant
Weaviate
Chroma
FAISS

Chunking

Chunking is splitting documents.

1. Why Chunking Matters

Bad chunking = bad retrieval.

Example problem:

Chunk 1:
RAG systems use semantic

Chunk 2:
search through vectors

Enter fullscreen mode Exit fullscreen mode

Meaning gets broken.

2. Character-Based Chunking

def chunk_text(text,
               chunk_size=800,
               overlap=150):

    chunks = []
    start = 0

    while start < len(text):

        end = start + chunk_size

        chunk = text[start:end]
        chunks.append(chunk)

        start += chunk_size - overlap

    return chunks

Enter fullscreen mode Exit fullscreen mode

3. Overlap

Preserves context.

Example:

Chunk 1 → 0-800
Chunk 2 → 650-1450

Enter fullscreen mode Exit fullscreen mode

Overlap:

150 characters

Enter fullscreen mode Exit fullscreen mode

Similarity Search

Pinecone compares vectors.

Usually using:

Cosine Similarity

Measures angle similarity.

Similar meaning:

High cosine score

Enter fullscreen mode Exit fullscreen mode

Retrieval Pipeline

Example retrieval:

query_embedding = generate_embedding(query)

results = index.query(
    vector=query_embedding,
    top_k=5,
    include_metadata=True
)

Enter fullscreen mode Exit fullscreen mode

Explanation:

vector=query_embedding

Enter fullscreen mode Exit fullscreen mode

Search using question vector.

top_k=5

Enter fullscreen mode Exit fullscreen mode

Retrieve top 5 results.

include_metadata=True

Enter fullscreen mode Exit fullscreen mode

Return original chunk text.

Prompt Augmentation

This is the "augmentation" in RAG.

We inject context.

Example:

context = "\n\n".join(
    match["metadata"]["text"]
    for match in results["matches"]
)

Enter fullscreen mode Exit fullscreen mode

Prompt Example

prompt = f"""
You are a helpful assistant.

Answer ONLY using the context.

Context:
{context}

Question:
{query}

Answer:
"""

Enter fullscreen mode Exit fullscreen mode

Generation Phase

Send prompt to the LLM.
For me, I used my local LLM Mistral

response = ollama.chat(
    model="mistral",
    messages=[
        {
            "role": "user",
            "content": prompt
        }
    ]
)

print(response["message"]["content"])

Enter fullscreen mode Exit fullscreen mode

Pinecone Concepts

Below are some Pinecone concepts I used and hope you might find helpful.

1. Index

Container of vectors.

Equivalent to:

Database table

Enter fullscreen mode Exit fullscreen mode

2. Creating Index

from pinecone import Pinecone

pc = Pinecone(api_key=API_KEY)

pc.create_index(
    name="rag-demo",
    dimension=768,
    metric="cosine",
    spec={
        "serverless": {
            "cloud": "aws",
            "region": "us-east-1"
        }
    }
)

Enter fullscreen mode Exit fullscreen mode

3. Upsert

Insert/update vectors.

index.upsert(vectors=vectors)

Enter fullscreen mode Exit fullscreen mode

4. Query

Search vectors.

index.query(...)

Enter fullscreen mode Exit fullscreen mode

5. Delete

Delete vectors.

index.delete(delete_all=True)

Enter fullscreen mode Exit fullscreen mode

Metadata in RAG

Store useful context.

Example:

metadata={
    "text": chunk,
    "source": "notes.txt",
    "section": "embeddings"
}

Enter fullscreen mode Exit fullscreen mode

Useful later for:

filtering
citations
debugging

Best Practices

These are some best practices to follow when building your RAG system:

Retrieval quality > model quality
Use metadata
Keep chunks meaningful
Avoid tiny chunks
Re-index after document updates
Use overlap
Start simple before frameworks
Debug retrieval separately from generation

However there is some considerations, as real production RAG systems often add features not present in my personal simple RAG system, such as:

authentication
streaming
caching
citations
reranking
hybrid search
observability
evaluation pipelines
vector versioning
document syncing

Glossary

Term

Meaning

RAG

Retrieval-Augmented Generation

Embedding

Numerical representation of text

Vector

Ordered list of numbers

Dimension

Number of values in vector

Chunk

Small document section

Metadata

Extra vector information

Top-K

Number of retrieved results

Similarity Search

Finding closest vectors

Cosine Similarity

Vector closeness metric

Index

Pinecone vector collection

Upsert

Insert/update vector

Retrieval

Finding relevant knowledge

Generation

Producing final answer

Hallucination

Fabricated answer

Reranking

Reordering retrieved chunks

Hybrid Search

Semantic + keyword retrieval

Conclusion

Dear reader, I hope my POV of RAGs helped you even a little bit to understand how these systems work under the hood from embedding to retrieving to generating the proper response.
And this is the essence of a RAG system.

🎉 Mid-Year Sale — Unlock Full Article

Base plan from just $4.99/mo or $49/yr

7-day free trial · Cancel anytime · 30-day money-back

Introduction

What is RAG?

Traditional LLM

RAG based LLM

Why RAG?

1. Hallucinations

2. Knowledge Cutoff

3. Private Data

Core Architecture

How RAG Works Step by Step

1. Documents

2. Chunking

[](#3-embe

🎉 Mid-Year Sale — Unlock Full Article

Production Bundle