Futbol Report β building a multi-model LLM comparison on AWS Lambda
Architecting Deterministic Multi-Model Evaluation Pipelines on Serverless Infrastructure
Current Situation Analysis
Engineering teams increasingly need to evaluate multiple large language models against identical prompts and grounding data. The goal is rarely to declare a single "winner," but to understand behavioral variance: format compliance, context filtering, output length, and hallucination resistance under production-like constraints.
This problem is systematically overlooked because most evaluation setups are either static benchmark suites (which lack live context) or ad-hoc scripts that run sequentially without state management. Developers assume that sending the same prompt to different models yields comparable outputs. In reality, live search APIs return different result sets within minutes due to ranking volatility. Without locking context per execution cycle, cross-model comparisons become statistically meaningless.
Serverless deployment introduces additional friction. AWS Lambda defaults to a 3-second timeout, which immediately terminates pipelines requiring sequential API calls. Python runtime mismatches break compiled C-extensions silently. Console-based deployment caching causes SHA-256 collisions, forcing engineers to rebuild packages repeatedly. These aren't edge cases; they are standard infrastructure realities that derail evaluation pipelines before they produce a single useful report.
Data from production runs confirms the scale of the issue: live search queries return non-deterministic results within 30-minute windows. Model output length correlates directly with parameter count and pricing tier. Format adherence varies independently of raw capability. Without a deterministic context lock, unified inference routing, and proper state expiration, multi-model evaluation remains an exercise in noise.
WOW Moment: Key Findings
Running identical search context through four distinct models reveals consistent behavioral patterns that benchmark scores rarely capture. The following table summarizes observed metrics across a controlled evaluation cycle:
| Model | Context Utilization | Format Compliance | Hallucination Resistance | Cost-to-Detail Ratio |
|---|---|---|---|---|
| Claude | High (~90%) | Strict | High | Moderate |
| Kimi | High (~85%) | Moderate-High | High | Moderate |
| Qwen | Medium (~70%) | Moderate | High | Low-Moderate |
| Gemma | Low (~40%) | Low | High | Very High |
Why this matters: Raw capability scores don't dictate production suitability. A cheaper model may collapse context into concise summaries, which is ideal for notification digests but unacceptable for analytical reports. Format compliance often correlates with instruction-following training rather than parameter count. Hallucination resistance, however, responds uniformly to explicit grounding constraints across all architectures. This finding enables engineers to select models based on workload topology rather than marketing benchmarks.
Core Solution
The pipeline follows a deterministic fetch-lock-infer-store-render cycle. Each component is chosen for statelessness, predictable latency, and minimal operational overhead.
Architecture Overview
EventBridge Scheduler (cron)
β
AWS Lambda (Python 3.12)
βββ Context Fetcher (Brave Search)
βββ Context Locker (In-Memory Payload)
βββ Inference Router (OpenRouter)
β
Redis (Vercel)
βββ Report Store (90-day TTL)
βββ Vote Index (No TTL)
β
Next.js (Vercel)
βββ SSR Comparison Page
βββ Vote API Route
Step-by-Step Implementation
1. Context Acquisition & Locking Live search APIs must be called once per execution cycle. The results are serialized into a single payload and held in memory before any model receives it. This eliminates ranking volatility as a variable.
import httpx
import json
import os
from datetime import datetime, timezone
SEARCH_QUERIES = [
"recent football fixtures",
"transfer market updates",
"managerial changes",
"league standings"
]
async def acquire_context() -> dict:
"""Fetches all search queries and locks them into a single context payload."""
async with httpx.AsyncClient() as client:
tasks = [
client.get(
"https://api.search.brave.com/res/v1/web/search",
headers={"X-Subscription-Token": os.environ["BRAVE_API_KEY"]},
params={"q": q, "count": 5}
)
for q in SEARCH_QUERIES
]
responses = await asyncio.gather(*tasks)
locked_context = {
"run_id": datetime.now(timezone.utc).isoformat(),
"queries": SEARCH_QUERIES,
"results": [r.json() for r in responses if r.status_code == 200]
}
return locked_context
2. Unified Inference Routing OpenRouter abstracts provider-specific authentication and endpoint routing. Adding or swapping a model requires modifying a single configuration dictionary rather than managing multiple SDKs.
MODEL_ROUTING = {
"claude": "anthropic/claude-3.5-sonnet",
"kimi": "moonshot/kimi-k2",
"qwen": "qwen/qwen2.5-72b-instruct",
"gemma": "google/gemma-2-27b-it"
}
SYSTEM_PROMPT = """
You are a sports analyst. Use ONLY facts present in the provided search results.
If data is sparse, prioritize transfer news and managerial updates.
Follow the requested output structure exactly.
"""
async def run_inference(context: dict) -> dict:
"""Sends identical context to all configured models concurrently."""
async with httpx.AsyncClient() as client:
tasks = []
for alias, model_id in MODEL_ROUTING.items():
payload = {
"model": model_id,
"messages": [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": json.dumps(context, indent=2)}
],
"temperature": 0.2,
"max_tokens": 2048
}
tasks.append(
client.post(
"https://openrouter.ai/api/v1/chat/completions",
headers={
"Authorization": f"Bearer {os.environ['OPENROUTER_API_KEY']}",
"HTTP-Referer": os.environ.get("SITE_URL", ""),
"X-Title": "Multi-Model Evaluator"
},
json=payload
)
)
responses = await asyncio.gather(*tasks)
return {
alias: r.json()["choices"][0]["message"]["content"]
for alias, r in zip(MODEL_ROUTING.keys(), responses)
if r.status_code == 200
}
3. State Management with TTL Strategy Redis is optimal for this access pattern: small payloads, timestamped keys, and pure lookups. Reports expire after 90 days to control memory costs. Vote indices and run metadata persist indefinitely.
import redis
import os
CACHE_CLIENT = redis.Redis(
host=os.environ["REDIS_HOST"],
port=int(os.environ.get("REDIS_PORT", 6379)),
password=os.environ.get("REDIS_PASSWORD"),
decode_responses=True
)
def persist_run(run_id: str, reports: dict) -> None:
"""Stores model outputs with a 90-day expiration. Vote index remains permanent."""
for model, content in reports.items():
key = f"report:{run_id}:{model}"
CACHE_CLIENT.setex(key, 7776000, content) # 90 days in seconds
# Run index for navigation
CACHE_CLIENT.sadd("run_index", run_id)
4. Server-Side Rendering & Voting The comparison page is static between runs. Next.js server components fetch directly from Redis, eliminating client-side latency. Voting uses a lightweight API route with server-side deduplication.
// app/comparison/page.tsx
import { createClient } from 'redis'
import { notFound } from 'next/navigation'
const redis = createClient({
url: process.env.REDIS_URL
})
export default async function ComparisonPage({ searchParams }: { searchParams: { run: string } }) {
await redis.connect()
const runId = searchParams.run
if (!runId) notFound()
const models = ['claude', 'kimi', 'qwen', 'gemma']
const reports = await Promise.all(
models.map(m => redis.get(`report:${runId}:${m}`))
)
const validReports = reports.filter(Boolean)
if (validReports.length === 0) notFound()
return (
<main className="grid grid-cols-1 md:grid-cols-2 gap-6 p-8">
{models.map((model, i) => (
<article key={model} className="border rounded-lg p-4">
<h2 className="text-xl font-bold mb-2 capitalize">{model}</h2>
<div className="prose whitespace-pre-wrap">{reports[i]}</div>
</article>
))}
</main>
)
}
Architecture Rationale
- OpenRouter over direct APIs: Reduces authentication overhead, standardizes rate limiting, and enables instant model swaps without redeploying infrastructure.
- Redis over DynamoDB: Key-value lookups with TTLs are native to Redis. DynamoDB requires additional TTL configuration and lacks native string expiration semantics without streams.
- SSR over CSR: Data mutates every 72 hours. Client-side fetching adds unnecessary latency and complicates SEO. Server components render once per request.
- Concurrent inference: Sequential model calls extend Lambda runtime unnecessarily.
asyncio.gatherreduces total execution time by ~75% while staying within memory limits.
Pitfall Guide
| Pitfall | Explanation | Fix |
|---|---|---|
| Runtime Version Drift | AWS may default to Python 3.14 while your build targets 3.12. Compiled C-extensions like pydantic_core fail to load with No module named '_pydantic_core'. |
Pin runtime explicitly in deployment config. Build packages against the exact target version using Docker or CI runners matching the Lambda environment. |
| Cross-Platform Binary Contamination | Packaging tools like uv sometimes cache macOS wheels when targeting Linux. Lambda rejects non-ELF binaries. |
Use pip install --platform manylinux2014_x86_64 --only-binary=:all: --target ./package to force Linux-compatible artifacts. Verify with file *.so. |
| The 3-Second Timeout Trap | Lambda's default timeout terminates pipelines requiring multiple API calls. Error manifests as Task timed out with no stack trace. |
Set timeout to 900 seconds (Lambda maximum). Monitor actual duration via CloudWatch and adjust memory accordingly, as CPU scales with allocated RAM. |
| Console Upload Caching | AWS Console sometimes matches SHA-256 hashes of previously uploaded ZIPs, refusing to update code even after file selection. | Use aws lambda update-function-code --function-name <name> --zip-file fileb://function.zip via CLI. It bypasses console caching and provides immediate feedback. |
| Context Volatility | Live search APIs return different rankings within minutes. Sending queries sequentially to different models breaks comparison validity. | Fetch all queries once, serialize into a single payload, and distribute that exact payload to every model. Never call search APIs per-model. |
| Secret Exposure via Echo | Pasting CLI output or console logs containing environment variables leaks API keys. Rotation becomes mandatory and costly. | Never echo secrets. Use AWS Secrets Manager or SSM Parameter Store. Integrate gitleaks or trufflehog into CI to scan commits before merge. |
| Sequential Inference Bottlenecks | Calling models one-by-one multiplies latency. A 4-model pipeline can exceed 4 minutes sequentially, risking timeout or throttling. | Use async concurrency (asyncio.gather or concurrent.futures). Ensure Lambda memory is scaled to 512MB+ to handle parallel HTTP connections. |
Production Bundle
Action Checklist
- Pin Python runtime to 3.12 in
template.yamlorserverless.ymland verify withpython --versionin CI - Replace sequential model calls with async concurrency to reduce runtime by ~75%
- Implement server-side vote deduplication using IP hashing or session tokens instead of
localStorage - Wire
gitleaksandosv-scannerinto GitHub Actions to prevent secret leaks and CVE propagation - Add full-page content extraction (Firecrawl or readability parsers) to replace thin search snippets
- Configure CloudWatch Alarms for Lambda duration > 600s and error rate > 1%
- Rotate API keys quarterly and store them in AWS Secrets Manager with automatic rotation policies
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|---|---|---|
| Evaluating 2-4 models with identical prompts | OpenRouter unified routing | Single auth, standardized rate limits, instant model swaps | Moderate (aggregated billing) |
| Evaluating 10+ models or custom fine-tunes | Direct provider APIs | Avoids OpenRouter markup, enables provider-specific features | Higher (multiple accounts/billing) |
| State requires complex queries or filtering | DynamoDB with TTL streams | Supports secondary indexes, conditional writes, audit trails | Higher (read/write capacity) |
| State requires simple key lookups + expiration | Redis with SETEX |
Native TTL, sub-millisecond reads, lower operational overhead | Lower (managed Redis or self-hosted) |
| Comparison page updates daily | Next.js SSR + Redis | Eliminates client latency, improves SEO, reduces bandwidth | Neutral (Vercel compute scales) |
| Comparison page updates hourly | Client-side fetching + SWR | Reduces server load, enables real-time updates | Higher (CDN + client compute) |
Configuration Template
# sam-template.yaml
AWSTemplateFormatVersion: '2010-09-09'
Transform: AWS::Serverless-2016-10-31
Resources:
EvaluationFunction:
Type: AWS::Serverless::Function
Properties:
CodeUri: src/
Handler: handler.main
Runtime: python3.12
Timeout: 900
MemorySize: 1024
Environment:
Variables:
BRAVE_API_KEY: !Ref BraveApiKey
OPENROUTER_API_KEY: !Ref OpenRouterApiKey
REDIS_HOST: !Ref RedisHost
REDIS_PORT: 6379
SITE_URL: !Ref SiteUrl
Policies:
- AWSLambdaBasicExecutionRole
Events:
ScheduleRun:
Type: Schedule
Properties:
Schedule: rate(3 days)
Enabled: true
Parameters:
BraveApiKey:
Type: String
NoEcho: true
OpenRouterApiKey:
Type: String
NoEcho: true
RedisHost:
Type: String
SiteUrl:
Type: String
Default: "https://your-domain.com"
Quick Start Guide
- Initialize the project: Run
sam initwith the Python 3.12 runtime. Create asrc/directory and place the Lambda handler,requirements.txt, andtemplate.yamlinside. - Install dependencies: Execute
pip install -r requirements.txt -t src/package/using themanylinux2014_x86_64platform flag. Verify all.sofiles target Linux. - Deploy infrastructure: Run
sam build && sam deploy --guided. Provide API keys when prompted. The stack will create the Lambda function, EventBridge schedule, and IAM roles automatically. - Verify execution: Trigger the function manually via
aws lambda invoke --function-name <stack-name>-EvaluationFunction-<id> output.json. Check CloudWatch Logs for context locking and concurrent inference completion. - Connect frontend: Deploy the Next.js application to Vercel. Configure
REDIS_URLandSITE_URLin environment variables. The comparison page will automatically render reports as they populate Redis.
Mid-Year Sale β Unlock Full Article
Base plan from just $4.99/mo or $49/yr
Sign in to read the full article and unlock all tutorials.
Sign In / Register β Start Free Trial7-day free trial Β· Cancel anytime Β· 30-day money-back
