ddings)3. Embeddings
Every chunk becomes a vector.
Example:
"RAG systems use retrieval"
Enter fullscreen mode Exit fullscreen mode
becomes:
[0.12, -0.77, 0.48, ...]
Enter fullscreen mode Exit fullscreen mode
4. Store in Vector Database
Vectors are stored in:
- Pinecone
- Weaviate
- Qdrant
- Chroma
- FAISS
5. User Question
Example:
What are embeddings?
Enter fullscreen mode Exit fullscreen mode
Question becomes a vector too.
6. Similarity Search
The vector database finds:
Most similar chunks
Enter fullscreen mode Exit fullscreen mode
based on mathematical similarity.
7. Prompt Construction
Retrieved chunks are injected into prompt.
Example:
Context:
Embeddings are vector representations.
Question:
What are embeddings?
Enter fullscreen mode Exit fullscreen mode
8. LLM Generation
The LLM generates an answer using retrieved context.
Key Concepts and Definitions
1. Embedding
A numerical semantic representation of text.
Example:
"Machine learning"
β
[0.12, -0.34, ...]
Enter fullscreen mode Exit fullscreen mode
Purpose:
- semantic understanding
- similarity search
2. Vector
An ordered list of numbers.
Example:
[0.12, -0.55, 0.91]
Enter fullscreen mode Exit fullscreen mode
3. Dimension
The number of values inside a vector.
Example:
768-dimensional vector
Enter fullscreen mode Exit fullscreen mode
means:
768 numbers
Enter fullscreen mode Exit fullscreen mode
Why it matters:
Your vector DB dimension must match embedding dimension.
Example:
nomic-embed-text β 768
Pinecone index β must be 768
Enter fullscreen mode Exit fullscreen mode
4. Semantic Search
Search by meaning.
Not exact keywords.
Example:
Question:
How does memory work?
Enter fullscreen mode Exit fullscreen mode
Can retrieve:
Agents retain context using memory systems.
Enter fullscreen mode Exit fullscreen mode
5. Similarity Score
Measures closeness between vectors.
Higher score:
More relevant
Enter fullscreen mode Exit fullscreen mode
Top-K
How many results to retrieve.
Example:
top_k=5
Enter fullscreen mode Exit fullscreen mode
Means:
Return best 5 chunks
Enter fullscreen mode Exit fullscreen mode
Extra information attached to vectors.
Example:
{
"text": "Embeddings are vectors",
"source": "notes.txt",
"topic": "rag"
}
Enter fullscreen mode Exit fullscreen mode
Embeddings Explained
Embeddings convert text into mathematical meaning.
Texts with similar meanings end up close together.
Example:
"How to build AI agents"
Enter fullscreen mode Exit fullscreen mode
and
"Creating autonomous agents"
Enter fullscreen mode Exit fullscreen mode
become nearby vectors.
Generating Embeddings with Ollama
import ollama
def generate_embedding(text):
response = ollama.embeddings(
model="nomic-embed-text",
prompt=text
)
return response["embedding"]
Enter fullscreen mode Exit fullscreen mode
Test:
embedding = generate_embedding(
"What is RAG?"
)
print(len(embedding))
print(embedding[:10])
Enter fullscreen mode Exit fullscreen mode
The code snippets seen above are from a RAG project I implemented, you can view the source code here
Vector Databases
A vector database stores embeddings.
Traditional DB:
Search by exact values
Enter fullscreen mode Exit fullscreen mode
Vector DB:
Search by similarity
Enter fullscreen mode Exit fullscreen mode
Common vector DBs:
- Pinecone
- Qdrant
- Weaviate
- Chroma
- FAISS
Chunking
Chunking is splitting documents.
1. Why Chunking Matters
Bad chunking = bad retrieval.
Example problem:
Chunk 1:
RAG systems use semantic
Chunk 2:
search through vectors
Enter fullscreen mode Exit fullscreen mode
Meaning gets broken.
2. Character-Based Chunking
def chunk_text(text,
chunk_size=800,
overlap=150):
chunks = []
start = 0
while start < len(text):
end = start + chunk_size
chunk = text[start:end]
chunks.append(chunk)
start += chunk_size - overlap
return chunks
Enter fullscreen mode Exit fullscreen mode
3. Overlap
Preserves context.
Example:
Chunk 1 β 0-800
Chunk 2 β 650-1450
Enter fullscreen mode Exit fullscreen mode
Overlap:
150 characters
Enter fullscreen mode Exit fullscreen mode
Similarity Search
Pinecone compares vectors.
Usually using:
Cosine Similarity
Measures angle similarity.
Similar meaning:
High cosine score
Enter fullscreen mode Exit fullscreen mode
Retrieval Pipeline
Example retrieval:
query_embedding = generate_embedding(query)
results = index.query(
vector=query_embedding,
top_k=5,
include_metadata=True
)
Enter fullscreen mode Exit fullscreen mode
Explanation:
vector=query_embedding
Enter fullscreen mode Exit fullscreen mode
Search using question vector.
top_k=5
Enter fullscreen mode Exit fullscreen mode
Retrieve top 5 results.
include_metadata=True
Enter fullscreen mode Exit fullscreen mode
Return original chunk text.
Prompt Augmentation
This is the "augmentation" in RAG.
We inject context.
Example:
context = "\n\n".join(
match["metadata"]["text"]
for match in results["matches"]
)
Enter fullscreen mode Exit fullscreen mode
Prompt Example
prompt = f"""
You are a helpful assistant.
Answer ONLY using the context.
Context:
{context}
Question:
{query}
Answer:
"""
Enter fullscreen mode Exit fullscreen mode
Generation Phase
Send prompt to the LLM.
For me, I used my local LLM Mistral
response = ollama.chat(
model="mistral",
messages=[
{
"role": "user",
"content": prompt
}
]
)
print(response["message"]["content"])
Enter fullscreen mode Exit fullscreen mode
Pinecone Concepts
Below are some Pinecone concepts I used and hope you might find helpful.
1. Index
Container of vectors.
Equivalent to:
Database table
Enter fullscreen mode Exit fullscreen mode
2. Creating Index
from pinecone import Pinecone
pc = Pinecone(api_key=API_KEY)
pc.create_index(
name="rag-demo",
dimension=768,
metric="cosine",
spec={
"serverless": {
"cloud": "aws",
"region": "us-east-1"
}
}
)
Enter fullscreen mode Exit fullscreen mode
3. Upsert
Insert/update vectors.
index.upsert(vectors=vectors)
Enter fullscreen mode Exit fullscreen mode
4. Query
Search vectors.
index.query(...)
Enter fullscreen mode Exit fullscreen mode
5. Delete
Delete vectors.
index.delete(delete_all=True)
Enter fullscreen mode Exit fullscreen mode
Store useful context.
Example:
metadata={
"text": chunk,
"source": "notes.txt",
"section": "embeddings"
}
Enter fullscreen mode Exit fullscreen mode
Useful later for:
- filtering
- citations
- debugging
Best Practices
These are some best practices to follow when building your RAG system:
- Retrieval quality > model quality
- Use metadata
- Keep chunks meaningful
- Avoid tiny chunks
- Re-index after document updates
- Use overlap
- Start simple before frameworks
- Debug retrieval separately from generation
However there is some considerations, as real production RAG systems often add features not present in my personal simple RAG system, such as:
- authentication
- streaming
- caching
- citations
- reranking
- hybrid search
- observability
- evaluation pipelines
- vector versioning
- document syncing
Glossary
Term
Meaning
RAG
Retrieval-Augmented Generation
Embedding
Numerical representation of text
Vector
Ordered list of numbers
Dimension
Number of values in vector
Chunk
Small document section
Metadata
Extra vector information
Top-K
Number of retrieved results
Similarity Search
Finding closest vectors
Cosine Similarity
Vector closeness metric
Index
Pinecone vector collection
Upsert
Insert/update vector
Retrieval
Finding relevant knowledge
Generation
Producing final answer
Hallucination
Fabricated answer
Reranking
Reordering retrieved chunks
Hybrid Search
Semantic + keyword retrieval
Conclusion
Dear reader, I hope my POV of RAGs helped you even a little bit to understand how these systems work under the hood from embedding to retrieving to generating the proper response.
And this is the essence of a RAG system.