try:
self.index = self.client.get_index(self.index_name)
except meilisearch.errors.MeilisearchApiError:
task = self.client.create_index(self.index_name, {"primaryKey": "doc_id"})
self.client.wait_for_task(task.task_uid)
self.index = self.client.get_index(self.index_name)
self.index.update_settings({
"searchableAttributes": ["headline", "body_text", "keywords"],
"filterableAttributes": ["domain", "format_type", "version"],
"rankingRules": [
"words", "typo", "proximity", "attribute", "sort", "exactness"
],
"typoTolerance": {
"enabled": True,
"minWordSizeForTypos": {"oneTypo": 5, "twoTypos": 9}
}
})
def ingest(self, records: List[Dict]) -> None:
for record in records:
if "doc_id" not in record:
record["doc_id"] = hashlib.sha256(record["body_text"].encode()).hexdigest()[:12]
task = self.index.add_documents(records, primary_key="doc_id")
self.client.wait_for_task(task.task_uid)
print(f"Successfully indexed {len(records)} records.")
**Why this structure**: The `_ensure_index` method handles idempotent setup, preventing race conditions during deployment. Typo tolerance thresholds are raised slightly (`oneTypo: 5`) to reduce false positives on technical acronyms while preserving resilience against common misspellings.
### Step 2: Query Execution & Filtering
```python
class RetrievalEngine:
def __init__(self, index: KnowledgeIndex, default_k: int = 5):
self.index = index
self.default_k = default_k
def fetch_context(self, query: str, k: Optional[int] = None,
domain_filter: Optional[str] = None) -> List[Dict]:
limit = k or self.default_k
params = {
"limit": limit,
"attributesToRetrieve": ["doc_id", "headline", "body_text", "domain"],
"attributesToHighlight": ["body_text"],
"highlightPreTag": "<mark>",
"highlightPostTag": "</mark>"
}
if domain_filter:
params["filter"] = f"domain = '{domain_filter}'"
response = self.index.index.search(query, params)
return response.get("hits", [])
Why this structure: Separating retrieval from indexing enables independent scaling. The filter parameter uses exact string matching, which Meilisearch optimizes via inverted indexes. Returning only necessary attributes reduces payload size and accelerates prompt assembly.
Step 3: Context Assembly & Prompt Engineering
class PromptAssembler:
SYSTEM_INSTRUCTION = (
"You are a technical reference assistant. Base your response strictly on the provided sources. "
"Do not introduce external knowledge. Cite each source using its bracketed index. "
"If the sources lack sufficient information, state that explicitly."
)
@classmethod
def compile(cls, user_query: str, context_docs: List[Dict]) -> List[Dict]:
formatted_sources = []
for idx, doc in enumerate(context_docs, start=1):
truncated_body = doc["body_text"][:1100]
formatted_sources.append(f"[{idx}] {doc['headline']}\n{truncated_body}")
context_block = "\n\n---\n\n".join(formatted_sources)
return [
{"role": "system", "content": cls.SYSTEM_INSTRUCTION},
{"role": "user", "content": f"Reference Material:\n{context_block}\n\n---\n\nUser Query: {user_query}"}
]
Why this structure: Explicit system instructions reduce model drift. Truncating to 1100 characters preserves token budget while maintaining semantic completeness. The delimiter (---) creates clear boundaries for the model's attention mechanism.
Step 4: Streaming Generation
from openai import OpenAI
from typing import Generator
class GenerationClient:
def __init__(self, api_key: str, base_url: str, model_id: str):
self.sdk = OpenAI(api_key=api_key, base_url=base_url)
self.model_id = model_id
def stream_response(self, messages: List[Dict]) -> Generator[str, None, None]:
response_stream = self.sdk.chat.completions.create(
model=self.model_id,
messages=messages,
stream=True,
temperature=0.15,
max_tokens=900
)
for chunk in response_stream:
delta = chunk.choices[0].delta
if delta.content:
yield delta.content
Why this structure: Low temperature (0.15) prioritizes factual consistency over creativity. Streaming decouples network latency from UX, allowing immediate token delivery. The generator pattern enables seamless integration with FastAPI, WebSocket, or CLI interfaces.
Step 5: Retrieval Validation
def validate_retrieval(engine: RetrievalEngine, benchmark: List[Dict], k: int = 5) -> float:
correct = 0
for entry in benchmark:
results = engine.fetch_context(entry["query"], k=k)
retrieved_ids = {hit["doc_id"] for hit in results}
if entry["target_id"] in retrieved_ids:
correct += 1
hit_rate = correct / len(benchmark)
print(f"Hit Rate @{k}: {hit_rate:.2%} ({correct}/{len(benchmark)})")
return hit_rate
# Benchmark dataset
VALIDATION_SET = [
{"query": "NIS 2 compliance thresholds for small enterprises", "target_id": "nis2-sme-041"},
{"query": "ISO 27001 control implementation checklist", "target_id": "iso27k-impl-012"},
{"query": "authorized penetration testing scope definition", "target_id": "pentest-scope-008"}
]
Why this structure: Validation is isolated from runtime logic to prevent accidental data leakage. Using a deterministic benchmark enables regression testing when tuning ranking rules or chunking strategies.
Pitfall Guide
1. Unbounded Context Truncation
Explanation: Blindly slicing text at fixed character counts ignores tokenization variance across models. This causes silent context loss or prompt overflow.
Fix: Implement token-aware slicing using tiktoken or the target model's tokenizer. Reserve 20% of the context window for the model's response and system instructions.
2. Ignoring Query Normalization
Explanation: User queries contain stop words, casing inconsistencies, and conversational filler that degrade BM25 scoring.
Fix: Pre-process queries with a lightweight normalization pipeline: lowercase conversion, stop-word removal, and synonym expansion for domain-specific acronyms.
3. Hardcoded Filter Logic
Explanation: Embedding filter strings directly into retrieval functions creates maintenance debt and prevents dynamic query composition.
Fix: Abstract filter construction into a builder pattern. Validate filter syntax against Meilisearch's filter grammar before execution to catch malformed expressions early.
4. Caching Raw LLM Outputs
Explanation: Caching full model responses assumes prompt stability. Minor context changes invalidate cached answers, leading to stale or contradictory outputs.
Fix: Cache retrieval results with short TTLs (5β15 minutes). For LLM outputs, cache only when the prompt hash, model version, and temperature remain identical. Use Redis with structured keys.
5. Skipping Retrieval Evaluation
Explanation: Deploying without measuring hit rate assumes lexical matching will perform adequately. This leads to silent degradation as corpora evolve.
Fix: Maintain a golden dataset of 50β100 query-target pairs. Run automated hit rate checks in CI/CD. Track Mean Reciprocal Rank (MRR) alongside hit rate to measure ranking quality.
6. Overlooking Typo Tolerance Thresholds
Explanation: Default typo settings trigger corrections on short technical terms (e.g., "API", "DNS"), causing false matches.
Fix: Increase minWordSizeForTypos to 5+ for one typo and 8+ for two typos. Disable typo tolerance on exact-match attributes like version numbers or IDs.
7. Neglecting Chunk Boundary Integrity
Explanation: Splitting documents at arbitrary character boundaries severs sentences and breaks logical flow, degrading retrieval relevance.
Fix: Chunk at semantic boundaries (paragraphs, sections) with 10% overlap. Store chunk_index and parent_doc_id to enable full-document reconstruction when needed.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Internal technical docs with consistent terminology | BM25 + Meilisearch | Lexical matching outperforms semantic drift; sub-10ms latency | Low (CPU-only, no embedding pipeline) |
| Open-domain customer support with colloquial phrasing | Vector RAG + Embeddings | Semantic similarity handles vocabulary divergence | High (GPU inference, vector storage, model versioning) |
| Compliance/regulatory corpus requiring exact clause matching | BM25 + Strict Filtering | Precision outweighs recall; faceted filters enforce scope | Low (deterministic indexing, minimal compute) |
| Multi-lingual knowledge base with translation gaps | Hybrid BM25 + Cross-Encoder | BM25 handles source language; re-ranker bridges semantic gaps | Medium (re-ranker adds ~30ms/query, CPU-friendly) |
Configuration Template
# meilisearch_config.yaml
index:
name: "technical_knowledge_base"
primary_key: "doc_id"
searchable:
- "headline"
- "body_text"
- "keywords"
filterable:
- "domain"
- "format_type"
- "version"
ranking_rules:
- "words"
- "typo"
- "proximity"
- "attribute"
- "sort"
- "exactness"
typo_tolerance:
enabled: true
min_word_size_for_typos:
one_typo: 5
two_typos: 9
runtime:
default_k: 5
context_truncation_chars: 1100
llm_temperature: 0.15
llm_max_tokens: 900
cache_ttl_seconds: 600
Quick Start Guide
- Launch Meilisearch: Run
docker run -d -p 7700:7700 getmeili/meilisearch:latest to start the retrieval backend.
- Install Dependencies: Execute
pip install meilisearch openai httpx tiktoken to provision the Python SDKs and tokenizer.
- Initialize Index: Instantiate
KnowledgeIndex with your host, API key, and index name. The class handles idempotent setup and schema configuration.
- Ingest & Validate: Load your JSONL corpus via
ingest(), then run validate_retrieval() against your golden dataset to confirm hit rate exceeds 85%.
- Deploy Stream Endpoint: Wrap
stream_response() in a FastAPI or Flask route. Pass user queries through fetch_context() β compile() β generator, and pipe tokens to the client.