Architecting Deterministic Multi-Model Evaluation Pipelines on Serverless Infrastructure

Current Situation Analysis

Engineering teams increasingly need to evaluate multiple large language models against identical prompts and grounding data. The goal is rarely to declare a single "winner," but to understand behavioral variance: format compliance, context filtering, output length, and hallucination resistance under production-like constraints.

This problem is systematically overlooked because most evaluation setups are either static benchmark suites (which lack live context) or ad-hoc scripts that run sequentially without state management. Developers assume that sending the same prompt to different models yields comparable outputs. In reality, live search APIs return different result sets within minutes due to ranking volatility. Without locking context per execution cycle, cross-model comparisons become statistically meaningless.

Serverless deployment introduces additional friction. AWS Lambda defaults to a 3-second timeout, which immediately terminates pipelines requiring sequential API calls. Python runtime mismatches break compiled C-extensions silently. Console-based deployment caching causes SHA-256 collisions, forcing engineers to rebuild packages repeatedly. These aren't edge cases; they are standard infrastructure realities that derail evaluation pipelines before they produce a single useful report.

Data from production runs confirms the scale of the issue: live search queries return non-deterministic results within 30-minute windows. Model output length correlates directly with parameter count and pricing tier. Format adherence varies independently of raw capability. Without a deterministic context lock, unified inference routing, and proper state expiration, multi-model evaluation remains an exercise in noise.

WOW Moment: Key Findings

Running identical search context through four distinct models reveals consistent behavioral patterns that benchmark scores rarely capture. The following table summarizes observed metrics across a controlled evaluation cycle:

Model	Context Utilization	Format Compliance	Hallucination Resistance	Cost-to-Detail Ratio
Claude	High (~90%)	Strict	High	Moderate
Kimi	High (~85%)	Moderate-High	High	Moderate
Qwen	Medium (~70%)	Moderate	High	Low-Moderate
Gemma	Low (~40%)	Low	High	Very High

Why this matters: Raw capability scores don't dictate production suitability. A cheaper model may collapse context into concise summaries, which is ideal for notification digests but unacceptable for analytical reports. Format compliance often correlates with instruction-following training rather than parameter count. Hallucination resistance, however, responds uniformly to explicit grounding constraints across all architectures. This finding enables engineers to select models based on workload topology rather than marketing benchmarks.

Core Solution

The pipeline follows a deterministic fetch-lock-infer-store-render cycle. Each component is chosen for statelessness, predictable latency, and minimal operational overhead.

Architecture Overview

EventBridge Scheduler (cron)
        ↓
   AWS Lambda (Python 3.12)
        ├── Context Fetcher (Brave Search)
        ├── Context Locker (In-Memory Payload)
        └── Inference Router (OpenRouter)
        ↓
   Redis (Vercel)
        ├── Report Store (90-day TTL)
        └── Vote Index (No TTL)
        ↓
   Next.js (Vercel)
        ├── SSR Comparison Page
        └── Vote API Route

Step-by-Step Implementation

1. Context Acquisition & Locking Live search APIs must be called once per execution cycle. The results are serialized into a single payload and held in memory before any model receives it. This eliminates ranking volatility as a variable.

import httpx
import json
import os
from datetime import datetime, timezone

SEARCH_QUERIES = [
    "recent football fixtures",
    "transfer market updates",
    "managerial changes",
    "league standings"
]

async def acquire_context() -> dict:
    """Fetches all search queries and locks them into a single context payload."""
    async with httpx.AsyncClient() as client:
        tasks = [
            client.get(
                "https://api.search.brave.com/res/v1/web/search",
                headers={"X-Subscription-Token": os.environ["BRAVE_API_KEY"]},
                params={"q": q, "count": 5}
            )
            for q in SEARCH_QUERIES
        ]
        responses = await asyncio.gather(*tasks)
        
    locked_context = {
        "run_id": datetime.now(timezone.utc).isoformat(),
        "queries": SEARCH_QUERIES,
        "results": [r.json() for r in responses if r.status_code == 200]
    }
    return locked_context

2. Unified Inference Routing OpenRouter abstracts provider-specific authentication and endpoint routing. Adding or swapping a model requires modifying a single configuration dictionary rather than managing multiple SDKs.

MODEL_ROUTING = {
    "claude": "anthropic/claude-3.5-sonnet",
    "kimi": "moonshot/kimi-k2",
    "qwen": "qwen/qwen2.5-72b-instruct",
    "gemma": "google/gemma-2-27b-it"
}

SYSTEM_PROMPT = """
You are a sports analyst. Use ONLY facts present in the provided search results.
If data is sparse, prioritize transfer news and managerial updates.
Follow the requested output structure exactly.
"""

async def run_inference(context: dict) -> dict:
    """Sends identical context to all configured models concurrently."""
    async with httpx.AsyncClient() as client:
        tasks = []
        for alias, model_id in MODEL_ROUTING.items():
            payload = {
                "model": model_id,
                "messages": [
                    {"role": "system", "content": SYSTEM_PROMPT},
                    {"role": "user", "content": json.dumps(context, indent=2)}
                ],
                "temperature": 0.2,
                "max_tokens": 2048
            }
            tasks.append(
                client.post(
                    "https://openrouter.ai/api/v1/chat/completions",
                    headers={
                        "Authorization": f"Bearer {os.environ['OPENROUTER_API_KEY']}",
                        "HTTP-Referer": os.environ.get("SITE_URL", ""),
                        "X-Title": "Multi-Model Evaluator"
                    },
                    json=payload
                )
            )
        
        responses = await asyncio.gather(*tasks)
        return {
            alias: r.json()["choices"][0]["message"]["content"]
            for alias, r in zip(MODEL_ROUTING.keys(), responses)
            if r.status_code == 200
        }

3. State Management with TTL Strategy Redis is optimal for this access pattern: small payloads, timestamped keys, and pure lookups. Reports expire after 90 days to control memory costs. Vote indices and run metadata persist indefinitely.

import redis
import os

CACHE_CLIENT = redis.Redis(
    host=os.environ["REDIS_HOST"],
    port=int(os.environ.get("REDIS_PORT", 6379)),
    password=os.environ.get("REDIS_PASSWORD"),
    decode_responses=True
)

def persist_run(run_id: str, reports: dict) -> None:
    """Stores model outputs with a 90-day expiration. Vote index remains permanent."""
    for model, content in reports.items():
        key = f"report:{run_id}:{model}"
        CACHE_CLIENT.setex(key, 7776000, content)  # 90 days in seconds
        
    # Run index for navigation
    CACHE_CLIENT.sadd("run_index", run_id)

4. Server-Side Rendering & Voting The comparison page is static between runs. Next.js server components fetch directly from Redis, eliminating client-side latency. Voting uses a lightweight API route with server-side deduplication.

// app/comparison/page.tsx
import { createClient } from 'redis'
import { notFound } from 'next/navigation'

const redis = createClient({
  url: process.env.REDIS_URL
})

export default async function ComparisonPage({ searchParams }: { searchParams: { run: string } }) {
  await redis.connect()
  const runId = searchParams.run
  
  if (!runId) notFound()
  
  const models = ['claude', 'kimi', 'qwen', 'gemma']
  const reports = await Promise.all(
    models.map(m => redis.get(`report:${runId}:${m}`))
  )
  
  const validReports = reports.filter(Boolean)
  if (validReports.length === 0) notFound()
  
  return (
    <main className="grid grid-cols-1 md:grid-cols-2 gap-6 p-8">
      {models.map((model, i) => (
        <article key={model} className="border rounded-lg p-4">
          <h2 className="text-xl font-bold mb-2 capitalize">{model}</h2>
          <div className="prose whitespace-pre-wrap">{reports[i]}</div>
        </article>
      ))}
    </main>
  )
}

Architecture Rationale

OpenRouter over direct APIs: Reduces authentication overhead, standardizes rate limiting, and enables instant model swaps without redeploying infrastructure.
Redis over DynamoDB: Key-value lookups with TTLs are native to Redis. DynamoDB requires additional TTL configuration and lacks native string expiration semantics without streams.
SSR over CSR: Data mutates every 72 hours. Client-side fetching adds unnecessary latency and complicates SEO. Server components render once per request.
Concurrent inference: Sequential model calls extend Lambda runtime unnecessarily. asyncio.gather reduces total execution time by ~75% while staying within memory limits.

Pitfall Guide

Pitfall	Explanation	Fix
Runtime Version Drift	AWS may default to Python 3.14 while your build targets 3.12. Compiled C-extensions like `pydantic_core` fail to load with `No module named '_pydantic_core'`.	Pin runtime explicitly in deployment config. Build packages against the exact target version using Docker or CI runners matching the Lambda environment.
Cross-Platform Binary Contamination	Packaging tools like `uv` sometimes cache macOS wheels when targeting Linux. Lambda rejects non-ELF binaries.	Use `pip install --platform manylinux2014_x86_64 --only-binary=:all: --target ./package` to force Linux-compatible artifacts. Verify with `file *.so`.
The 3-Second Timeout Trap	Lambda's default timeout terminates pipelines requiring multiple API calls. Error manifests as `Task timed out` with no stack trace.	Set timeout to 900 seconds (Lambda maximum). Monitor actual duration via CloudWatch and adjust memory accordingly, as CPU scales with allocated RAM.
Console Upload Caching	AWS Console sometimes matches SHA-256 hashes of previously uploaded ZIPs, refusing to update code even after file selection.	Use `aws lambda update-function-code --function-name <name> --zip-file fileb://function.zip` via CLI. It bypasses console caching and provides immediate feedback.
Context Volatility	Live search APIs return different rankings within minutes. Sending queries sequentially to different models breaks comparison validity.	Fetch all queries once, serialize into a single payload, and distribute that exact payload to every model. Never call search APIs per-model.
Secret Exposure via Echo	Pasting CLI output or console logs containing environment variables leaks API keys. Rotation becomes mandatory and costly.	Never echo secrets. Use AWS Secrets Manager or SSM Parameter Store. Integrate `gitleaks` or `trufflehog` into CI to scan commits before merge.
Sequential Inference Bottlenecks	Calling models one-by-one multiplies latency. A 4-model pipeline can exceed 4 minutes sequentially, risking timeout or throttling.	Use async concurrency (`asyncio.gather` or `concurrent.futures`). Ensure Lambda memory is scaled to 512MB+ to handle parallel HTTP connections.

Production Bundle

Action Checklist

Pin Python runtime to 3.12 in template.yaml or serverless.yml and verify with python --version in CI
Replace sequential model calls with async concurrency to reduce runtime by ~75%
Implement server-side vote deduplication using IP hashing or session tokens instead of localStorage
Wire gitleaks and osv-scanner into GitHub Actions to prevent secret leaks and CVE propagation
Add full-page content extraction (Firecrawl or readability parsers) to replace thin search snippets
Configure CloudWatch Alarms for Lambda duration > 600s and error rate > 1%
Rotate API keys quarterly and store them in AWS Secrets Manager with automatic rotation policies

Decision Matrix

Scenario	Recommended Approach	Why	Cost Impact
Evaluating 2-4 models with identical prompts	OpenRouter unified routing	Single auth, standardized rate limits, instant model swaps	Moderate (aggregated billing)
Evaluating 10+ models or custom fine-tunes	Direct provider APIs	Avoids OpenRouter markup, enables provider-specific features	Higher (multiple accounts/billing)
State requires complex queries or filtering	DynamoDB with TTL streams	Supports secondary indexes, conditional writes, audit trails	Higher (read/write capacity)
State requires simple key lookups + expiration	Redis with `SETEX`	Native TTL, sub-millisecond reads, lower operational overhead	Lower (managed Redis or self-hosted)
Comparison page updates daily	Next.js SSR + Redis	Eliminates client latency, improves SEO, reduces bandwidth	Neutral (Vercel compute scales)
Comparison page updates hourly	Client-side fetching + SWR	Reduces server load, enables real-time updates	Higher (CDN + client compute)

Configuration Template

# sam-template.yaml
AWSTemplateFormatVersion: '2010-09-09'
Transform: AWS::Serverless-2016-10-31

Resources:
  EvaluationFunction:
    Type: AWS::Serverless::Function
    Properties:
      CodeUri: src/
      Handler: handler.main
      Runtime: python3.12
      Timeout: 900
      MemorySize: 1024
      Environment:
        Variables:
          BRAVE_API_KEY: !Ref BraveApiKey
          OPENROUTER_API_KEY: !Ref OpenRouterApiKey
          REDIS_HOST: !Ref RedisHost
          REDIS_PORT: 6379
          SITE_URL: !Ref SiteUrl
      Policies:
        - AWSLambdaBasicExecutionRole
      Events:
        ScheduleRun:
          Type: Schedule
          Properties:
            Schedule: rate(3 days)
            Enabled: true

Parameters:
  BraveApiKey:
    Type: String
    NoEcho: true
  OpenRouterApiKey:
    Type: String
    NoEcho: true
  RedisHost:
    Type: String
  SiteUrl:
    Type: String
    Default: "https://your-domain.com"

Quick Start Guide

Initialize the project: Run sam init with the Python 3.12 runtime. Create a src/ directory and place the Lambda handler, requirements.txt, and template.yaml inside.
Install dependencies: Execute pip install -r requirements.txt -t src/package/ using the manylinux2014_x86_64 platform flag. Verify all .so files target Linux.
Deploy infrastructure: Run sam build && sam deploy --guided. Provide API keys when prompted. The stack will create the Lambda function, EventBridge schedule, and IAM roles automatically.
Verify execution: Trigger the function manually via aws lambda invoke --function-name <stack-name>-EvaluationFunction-<id> output.json. Check CloudWatch Logs for context locking and concurrent inference completion.
Connect frontend: Deploy the Next.js application to Vercel. Configure REDIS_URL and SITE_URL in environment variables. The comparison page will automatically render reports as they populate Redis.

Futbol Report — building a multi-model LLM comparison on AWS Lambda