Building a Serverless AI Model Evaluation Platform on AWS

Orchestrating Multi-Model AI Benchmarks with AWS Step Functions and Bedrock

Current Situation Analysis

Engineering teams evaluating Large Language Models (LLMs) face a scalability bottleneck when comparing outputs across multiple foundation models. The standard workflow—manually pasting prompts into model playgrounds, copying responses, and subjectively judging quality—is unrepeatable and unscalable. As organizations adopt multiple models (e.g., Llama 3, Claude, Nova) for different use cases, the need for automated, quantitative benchmarking becomes critical.

This problem is often overlooked because developers prioritize single-model integration over comparative analysis. However, model performance varies significantly by domain, and "best" is context-dependent. Without automation, teams cannot run the volume of experiments required to make data-driven model selection decisions.

The financial implications are also non-trivial. Foundation model APIs charge per token. A single benchmark run invoking three models on a long-form article can incur costs between $0.10 and $0.50. While this appears negligible, uncontrolled automated loops or lack of validation can cause costs to spike rapidly. Furthermore, latency compounds when models are invoked sequentially, making manual or poorly architected pipelines impractical for high-throughput evaluation.

WOW Moment: Key Findings

The architectural choice for parallel execution dramatically impacts both cost and latency. Moving from sequential invocations to a parallelized, single-execution model yields measurable gains. Additionally, adopting a unified API interface reduces code complexity and maintenance overhead when swapping models.

Execution Strategy	Latency Profile	Cost Efficiency	Error Resilience	Implementation Complexity
Sequential Lambdas	Sum of all model times	High (N invocations)	Low (Fail-fast)	Low
Step Functions Parallel	Max of model times	Medium (N invocations)	High	Medium
In-Lambda Parallelism	Max of model times	Low (1 invocation)	Medium (Timeout risk)	Medium
In-Lambda `allSettled`	Max of model times	Low (1 invocation)	High (Partial success)	Medium

Why this matters: Using Promise.allSettled within a single Lambda function provides the latency benefits of parallelism while minimizing invocation costs. Unlike Promise.all, which fails the entire batch if one model errors, allSettled allows partial results, ensuring that a timeout from one model does not discard successful inferences from others. This approach also enables shared memory for the input payload, avoiding repeated S3 reads or network transfers for each model call.

Core Solution

The solution is a serverless orchestration pipeline built on AWS Step Functions, AWS Lambda, and Amazon Bedrock. The system accepts an evaluation request, validates inputs, runs parallel inferences across specified models, scores outputs using a dedicated judge model, and generates a structured report.

Architecture Overview

API Gateway: Exposes a REST endpoint for triggering evaluations.
Step Functions: Orchestrates the workflow, managing state transitions and error handling.
Lambda Functions: Execute discrete steps: validation, parallel inference, scoring, and report generation.
Amazon Bedrock: Provides access to foundation models via the Converse API.
Amazon S3: Stores intermediate checkpoints and final reports.

Implementation Details

We use TypeScript for the Lambda functions to leverage strong typing and the AWS SDK v3. The pipeline is defined as a Step Functions state machine.

Step 1: Input Validation

Validation occurs before any model invocation to prevent wasted costs. We check for required fields, valid model IDs, and token limits.

import { z } from 'zod';

const EvaluationSchema = z.object({
  article: z.string().min(100).max(50000),
  models: z.array(z.string()).min(1).max(5),
  prompt: z.string().min(10),
  scoringCriteria: z.object({
    accuracy: z.number().min(1).max(10),
    engagement: z.number().min(1).max(10),
    structure: z.number().min(1).max(10),
  }),
});

export const handler = async (event: any) => {
  try {
    const validated = EvaluationSchema.parse(event.body);
    return { statusCode: 200, body: JSON.stringify(validated) };
  } catch (error) {
    if (error instanceof z.ZodError) {
      return { statusCode: 400, body: JSON.stringify({ errors: error.errors }) };
    }
    throw error;
  }
};

Step 2: Parallel Model Invocation

This Lambda uses Promise.allSettled to invoke multiple Bedrock models concurrently. We use the Bedrock Converse API, which provides a unified interface across all supported models, simplifying code maintenance.

import { BedrockRuntimeClient, ConverseCommand } from '@aws-sdk/client-bedrock-runtime';

const client = new BedrockRuntimeClient({ region: process.env.AWS_REGION });

interface ModelResult {
  modelId: string;
  output?: string;
  usage?: { inputTokens: number; outputTokens: number };
  error?: string;
}

export const handler = async (event: any): Promise<ModelResult[]> => {
  const { article, models, prompt } = event;

  const promises = models.map(async (modelId: string) => {
    try {
      const command = new ConverseCommand({
        modelId,
        messages: [{ role: 'user', content: [{ text: article }] }],
        system: [{ text: prompt }],
      });

      const response = await client.send(command);
      
      return {
        modelId,
        output: response.output?.message?.content?.[0]?.text || '',
        usage: {
          inputTokens: response.usage?.inputTokens || 0,
          outputTokens: response.usage?.outputTokens || 0,
        },
      };
    } catch (err: any) {
      return { modelId, error: err.message };
    }
  });

  const results = await Promise.allSettled(promises);

  return results.map((res, index) => {
    if (res.status === 'fulfilled') return res.value;
    return { modelId: models[index], error: res.reason };
  });
};

Rationale:

Converse API: Eliminates the need for model-specific request/response parsing. Switching from meta.llama3-70b to anthropic.claude-3 requires only changing the modelId.
Promise.allSettled: Ensures that if one model times out or returns an error, the results from other models are preserved. This is critical for benchmarking reliability.
Single Lambda Execution: Reduces cost compared to invoking separate Lambdas per model and allows the input article to reside in memory, avoiding redundant I/O.

Step 3: Scoring with a Judge Model

Outputs are scored by a separate "judge" model to avoid self-evaluation bias. The judge evaluates each output against the defined criteria.

const SCORING_MODEL = 'anthropic.claude-3-haiku';

const scoringPrompt = (criteria: any) => `
  Evaluate the following summary based on these criteria:
  - Accuracy (${criteria.accuracy}/10): Faithfulness to source.
  - Engagement (${criteria.engagement}/10): Listener appeal.
  - Structure (${criteria.structure}/10): Organization for audio.
  Return JSON: { "accuracy": number, "engagement": number, "structure": number }
`;

export const handler = async (event: any) => {
  const { outputs, criteria } = event;

  const scoringPromises = outputs.map(async (output: any) => {
    const command = new ConverseCommand({
      modelId: SCORING_MODEL,
      messages: [
        { role: 'user', content: [{ text: output.output }] },
        { role: 'user', content: [{ text: scoringPrompt(criteria) }] },
      ],
    });

    const response = await client.send(command);
    const scoreText = response.output?.message?.content?.[0]?.text || '{}';
    
    try {
      const scores = JSON.parse(scoreText);
      return { modelId: output.modelId, scores };
    } catch {
      return { modelId: output.modelId, scores: null, error: 'Invalid JSON from judge' };
    }
  });

  return Promise.allSettled(scoringPromises);
};

Rationale:

Separate Judge Model: Using a distinct model (e.g., Claude Haiku) for scoring prevents bias that occurs when a model evaluates its own output.
JSON Enforcement: The prompt requests JSON output, which is parsed programmatically. Error handling captures malformed responses.

Step 4: Report Generation

The final step aggregates results and generates an HTML report stored in S3.

import { S3Client, PutObjectCommand } from '@aws-sdk/client-s3';

const s3Client = new S3Client({});

export const handler = async (event: any) => {
  const { results, experimentId } = event;
  
  const html = generateHtmlReport(results);
  
  await s3Client.send(new PutObjectCommand({
    Bucket: process.env.REPORT_BUCKET,
    Key: `reports/${experimentId}/comparison.html`,
    Body: html,
    ContentType: 'text/html',
  }));

  return { reportUrl: `s3://${process.env.REPORT_BUCKET}/reports/${experimentId}/comparison.html` };
};

Pitfall Guide

Unsecured API Endpoints
- Explanation: Exposing a Bedrock invocation endpoint without authentication or rate limiting allows unauthorized users to generate unlimited API calls, resulting in unexpected charges.
- Fix: Implement API Keys, Usage Plans with quotas, and rate limiting at API Gateway. Monitor usage with CloudWatch alarms.
Sequential Model Invocation
- Explanation: Invoking models one after another increases latency linearly with the number of models and increases cost due to multiple Lambda invocations.
- Fix: Use parallel execution within a single Lambda function via Promise.allSettled or ThreadPoolExecutor (Python).
Self-Evaluation Bias
- Explanation: Using the same model to generate and score outputs can lead to inflated scores due to model-specific preferences or lack of objectivity.
- Fix: Use a distinct, smaller model (e.g., Claude Haiku) as a judge to evaluate outputs from larger models.
Ignoring Token Usage Metadata
- Explanation: Failing to capture input and output token counts prevents accurate cost tracking and anomaly detection.
- Fix: Parse the usage field from every Bedrock Converse API response and store it for billing and analysis.
Lambda Timeout Cascades
- Explanation: In a parallel execution model, a single slow model can cause the entire Lambda function to timeout, discarding all results.
- Fix: Implement per-request timeouts and use Promise.allSettled to handle partial failures gracefully. Configure Lambda timeout slightly higher than the maximum expected model latency.
Premature Database Adoption
- Explanation: Introducing a database like RDS or DynamoDB early adds complexity and cost when simple file storage suffices for initial workloads.
- Fix: Start with S3 for storing experiment data and reports. Migrate to DynamoDB only when query patterns (e.g., filtering by user or date) require indexed access.
Lack of Input Validation
- Explanation: Malformed inputs can trigger expensive model invocations that fail or produce garbage results.
- Fix: Validate all inputs (schema, token limits, model IDs) before invoking Bedrock. Fail fast to save costs.

Production Bundle

Action Checklist

Enable Billing Alarms: Configure CloudWatch alarms for Bedrock costs at $10 and $25 thresholds with SNS notifications.
Secure API Endpoints: Apply API Keys, Usage Plans, and rate limiting to all evaluation endpoints.
Implement Validation: Add schema validation and token limit checks before model invocation.
Use Converse API: Standardize on the Bedrock Converse API for unified model access and token metadata.
Parallelize Inference: Use Promise.allSettled within Lambda for parallel model calls to optimize latency and cost.
Capture Token Usage: Parse and store inputTokens and outputTokens from every response for billing and analysis.
Store Checkpoints: Write intermediate results to S3 to enable retries without re-invoking models.
Deploy Judge Model: Configure a separate model for scoring to avoid self-evaluation bias.

Decision Matrix

Scenario	Recommended Approach	Why	Cost Impact
High Concurrency, Low Latency	In-Lambda Parallelism	Single invocation reduces overhead; parallelism minimizes wall-clock time.	Low (1 invocation per batch)
Complex Error Handling	Step Functions Parallel	Native retry logic and error handling per branch; easier to debug.	Medium (N invocations)
Simple Storage Needs	Amazon S3	Cheap, durable, and sufficient for sequential read/write patterns.	Very Low
Query-Heavy Workloads	DynamoDB	Fast, indexed queries for experiment history and user sessions.	Medium (Read/Write capacity)
Model Swapping	Converse API	Unified interface allows changing models without code changes.	Neutral

Configuration Template

The following CDK snippet defines the Step Functions state machine for the evaluation pipeline.

import * as cdk from 'aws-cdk-lib';
import * as sfn from 'aws-cdk-lib/aws-stepfunctions';
import * as tasks from 'aws-cdk-lib/aws-stepfunctions-tasks';
import * as lambda from 'aws-cdk-lib/aws-lambda';
import * as path from 'path';

export class EvaluationPipelineStack extends cdk.Stack {
  constructor(scope: cdk.App, id: string, props?: cdk.StackProps) {
    super(scope, id, props);

    const validateFn = new lambda.Function(this, 'ValidateFunction', {
      runtime: lambda.Runtime.NODEJS_20_X,
      handler: 'validate.handler',
      code: lambda.Code.fromAsset(path.join(__dirname, 'lambda')),
    });

    const invokeFn = new lambda.Function(this, 'InvokeFunction', {
      runtime: lambda.Runtime.NODEJS_20_X,
      handler: 'invoke.handler',
      code: lambda.Code.fromAsset(path.join(__dirname, 'lambda')),
      timeout: cdk.Duration.seconds(60),
    });

    const scoreFn = new lambda.Function(this, 'ScoreFunction', {
      runtime: lambda.Runtime.NODEJS_20_X,
      handler: 'score.handler',
      code: lambda.Code.fromAsset(path.join(__dirname, 'lambda')),
      timeout: cdk.Duration.seconds(30),
    });

    const reportFn = new lambda.Function(this, 'ReportFunction', {
      runtime: lambda.Runtime.NODEJS_20_X,
      handler: 'report.handler',
      code: lambda.Code.fromAsset(path.join(__dirname, 'lambda')),
    });

    const definition = new sfn.Pass(this, 'Start', {
      result: sfn.Result.fromObject({ stage: 'validate' }),
    }).next(
      new tasks.LambdaInvoke(this, 'Validate', {
        lambdaFunction: validateFn,
        payload: sfn.TaskInput.fromJsonPathAt('$'),
      }).next(
        new tasks.LambdaInvoke(this, 'InvokeModels', {
          lambdaFunction: invokeFn,
          payload: sfn.TaskInput.fromJsonPathAt('$'),
        }).next(
          new tasks.LambdaInvoke(this, 'ScoreOutputs', {
            lambdaFunction: scoreFn,
            payload: sfn.TaskInput.fromJsonPathAt('$'),
          }).next(
            new tasks.LambdaInvoke(this, 'GenerateReport', {
              lambdaFunction: reportFn,
              payload: sfn.TaskInput.fromJsonPathAt('$'),
            })
          )
        )
      )
    );

    new sfn.StateMachine(this, 'EvaluationStateMachine', {
      definition,
      timeout: cdk.Duration.minutes(5),
    });
  }
}

Quick Start Guide

Deploy Infrastructure: Use the provided CDK template to deploy the Lambda functions, Step Functions state machine, and S3 buckets.
Configure Bedrock Access: Ensure the Lambda execution roles have bedrock:InvokeModel permissions for the required models.

Trigger Evaluation: Send a POST request to the API Gateway endpoint with the evaluation payload:

curl -X POST https://<api-id>.execute-api.<region>.amazonaws.com/prod/evaluate \
  -H "Content-Type: application/json" \
  -H "x-api-key: <your-api-key>" \
  -d '{
    "article": "Long text content...",
    "models": ["meta.llama3-70b-instruct-v1:0", "anthropic.claude-3-sonnet"],
    "prompt": "Summarize this as a podcast script.",
    "scoringCriteria": { "accuracy": 10, "engagement": 10, "structure": 10 }
  }'

Monitor Progress: Check CloudWatch Logs for Lambda execution and Step Functions console for workflow status.
Retrieve Report: Once the workflow completes, download the HTML report from the S3 bucket URL returned in the response.

Mid-Year Sale — Unlock Full Article