Building a Serverless AI Model Evaluation Platform on AWS
Orchestrating Multi-Model AI Benchmarks with AWS Step Functions and Bedrock
Current Situation Analysis
Engineering teams evaluating Large Language Models (LLMs) face a scalability bottleneck when comparing outputs across multiple foundation models. The standard workflow—manually pasting prompts into model playgrounds, copying responses, and subjectively judging quality—is unrepeatable and unscalable. As organizations adopt multiple models (e.g., Llama 3, Claude, Nova) for different use cases, the need for automated, quantitative benchmarking becomes critical.
This problem is often overlooked because developers prioritize single-model integration over comparative analysis. However, model performance varies significantly by domain, and "best" is context-dependent. Without automation, teams cannot run the volume of experiments required to make data-driven model selection decisions.
The financial implications are also non-trivial. Foundation model APIs charge per token. A single benchmark run invoking three models on a long-form article can incur costs between $0.10 and $0.50. While this appears negligible, uncontrolled automated loops or lack of validation can cause costs to spike rapidly. Furthermore, latency compounds when models are invoked sequentially, making manual or poorly architected pipelines impractical for high-throughput evaluation.
WOW Moment: Key Findings
The architectural choice for parallel execution dramatically impacts both cost and latency. Moving from sequential invocations to a parallelized, single-execution model yields measurable gains. Additionally, adopting a unified API interface reduces code complexity and maintenance overhead when swapping models.
| Execution Strategy | Latency Profile | Cost Efficiency | Error Resilience | Implementation Complexity |
|---|---|---|---|---|
| Sequential Lambdas | Sum of all model times | High (N invocations) | Low (Fail-fast) | Low |
| Step Functions Parallel | Max of model times | Medium (N invocations) | High | Medium |
| In-Lambda Parallelism | Max of model times | Low (1 invocation) | Medium (Timeout risk) | Medium |
In-Lambda allSettled |
Max of model times | Low (1 invocation) | High (Partial success) | Medium |
Why this matters: Using Promise.allSettled within a single Lambda function provides the latency benefits of parallelism while minimizing invocation costs. Unlike Promise.all, which fails the entire batch if one model errors, allSettled allows partial results, ensuring that a timeout from one model does not discard successful inferences from others. This approach also enables shared memory for the input payload, avoiding repeated S3 reads or network transfers for each model call.
Core Solution
The solution is a serverless orchestration pipeline built on AWS Step Functions, AWS Lambda, and Amazon Bedrock. The system accepts an evaluation request, validates inputs, runs parallel inferences across specified models, scores outputs using a dedicated judge model, and generates a structured report.
Architecture Overview
- API Gateway: Exposes a REST endpoint for triggering evaluations.
- Step Functions: Orchestrates the workflow, managing state transitions and error handling.
- Lambda Functions: Execute discrete steps: validation, parallel inference, scoring, and report generation.
- Amazon Bedrock: Provides access to foundation models via the Converse API.
- Amazon S3: Stores intermediate checkpoints and final reports.
Implementation Details
We use TypeScript for the Lambda functions to leverage strong typing and the AWS SDK v3. The pipeline is defined as a Step Functions state machine.
Step 1: Input Validation
Validation occurs before any model invocation to prevent wasted costs. We check for required fields, valid model IDs, and token limits.
import { z } from 'zod';
const EvaluationSchema = z.object({
article: z.string().min(100).max(50000),
models: z.array(z.string()).min(1).max(5),
prompt: z.string().min(10),
scoringCriteria: z.object({
accuracy: z.number().min(1).max(10),
engagement: z.number().min(1).max(10),
structure: z.number().min(1).max(10),
}),
});
export const handler = async (event: any) => {
try {
const validated = EvaluationSchema.parse(event.body);
return { statusCode: 200, body: JSON.stringify(validated) };
} catch (error) {
if (error instanceof z.ZodError) {
return { statusCode: 400, body: JSON.stringify({ errors: error.errors }) };
}
throw error;
}
};
Step 2: Parallel Model Invocation
This Lambda uses Promise.allSettled to invoke multiple Bedrock models concurrently. We use the Bedrock Converse API, which provides a unified interface across all supported models, simplifying code maintenance.
import { BedrockRuntimeClient, ConverseCommand } from '@aws-sdk/client-bedrock-runtime';
const client = new BedrockRuntimeClient({ region: process.env.AWS_REGION });
interface ModelResult {
modelId: string;
output?: string;
usage?: { inputTokens: number; outputTokens: number };
error?: string;
}
export const handler = async (event: any): Promise<ModelResult[]> => {
const { article, models, prompt } = event;
const promises = models.map(async (modelId: string) => {
try {
const command = new ConverseCommand({
modelId,
messages: [{ role: 'user', content: [{ text: article }] }],
system: [{ text: prompt }],
});
const response = await client.send(command);
return {
modelId,
output: response.output?.message?.content?.[0]?.text || '',
usage: {
inputTokens: response.usage?.inputTokens || 0,
outputTokens: response.usage?.outputTokens || 0,
},
};
} catch (err: any) {
return { modelId, error: err.message };
}
});
const results = await Promise.allSettled(promises);
return results.map((res, index) => {
if (res.status === 'fulfilled') return res.value;
return { modelId: models[index], error: res.reason };
});
};
Rationale:
- Converse API: Eliminates the need for model-specific request/response parsing. Switching from
meta.llama3-70btoanthropic.claude-3requires only changing themodelId. Promise.allSettled: Ensures that if one model times out or returns an error, the results from other models are preserved. This is critical for benchmarking reliability.- Single Lambda Execution: Reduces cost compared to invoking separate Lambdas per model and allows the input article to reside in memory, avoiding redundant I/O.
Step 3: Scoring with a Judge Model
Outputs are scored by a separate "judge" model to avoid self-evaluation bias. The judge evaluates each output against the defined criteria.
const SCORING_MODEL = 'anthropic.claude-3-haiku';
const scoringPrompt = (criteria: any) => `
Evaluate the following summary based on these criteria:
- Accuracy (${criteria.accuracy}/10): Faithfulness to source.
- Engagement (${criteria.engagement}/10): Listener appeal.
- Structure (${criteria.structure}/10): Organization for audio.
Return JSON: { "accuracy": number, "engagement": number, "structure": number }
`;
export const handler = async (event: any) => {
const { outputs, criteria } = event;
const scoringPromises = outputs.map(async (output: any) => {
const command = new ConverseCommand({
modelId: SCORING_MODEL,
messages: [
{ role: 'user', content: [{ text: output.output }] },
{ role: 'user', content: [{ text: scoringPrompt(criteria) }] },
],
});
const response = await client.send(command);
const scoreText = response.output?.message?.content?.[0]?.text || '{}';
try {
const scores = JSON.parse(scoreText);
return { modelId: output.modelId, scores };
} catch {
return { modelId: output.modelId, scores: null, error: 'Invalid JSON from judge' };
}
});
return Promise.allSettled(scoringPromises);
};
Rationale:
- Separate Judge Model: Using a distinct model (e.g., Claude Haiku) for scoring prevents bias that occurs when a model evaluates its own output.
- JSON Enforcement: The prompt requests JSON output, which is parsed programmatically. Error handling captures malformed responses.
Step 4: Report Generation
The final step aggregates results and generates an HTML report stored in S3.
import { S3Client, PutObjectCommand } from '@aws-sdk/client-s3';
const s3Client = new S3Client({});
export const handler = async (event: any) => {
const { results, experimentId } = event;
const html = generateHtmlReport(results);
await s3Client.send(new PutObjectCommand({
Bucket: process.env.REPORT_BUCKET,
Key: `reports/${experimentId}/comparison.html`,
Body: html,
ContentType: 'text/html',
}));
return { reportUrl: `s3://${process.env.REPORT_BUCKET}/reports/${experimentId}/comparison.html` };
};
Pitfall Guide
Unsecured API Endpoints
- Explanation: Exposing a Bedrock invocation endpoint without authentication or rate limiting allows unauthorized users to generate unlimited API calls, resulting in unexpected charges.
- Fix: Implement API Keys, Usage Plans with quotas, and rate limiting at API Gateway. Monitor usage with CloudWatch alarms.
Sequential Model Invocation
- Explanation: Invoking models one after another increases latency linearly with the number of models and increases cost due to multiple Lambda invocations.
- Fix: Use parallel execution within a single Lambda function via
Promise.allSettledorThreadPoolExecutor(Python).
Self-Evaluation Bias
- Explanation: Using the same model to generate and score outputs can lead to inflated scores due to model-specific preferences or lack of objectivity.
- Fix: Use a distinct, smaller model (e.g., Claude Haiku) as a judge to evaluate outputs from larger models.
Ignoring Token Usage Metadata
- Explanation: Failing to capture input and output token counts prevents accurate cost tracking and anomaly detection.
- Fix: Parse the
usagefield from every Bedrock Converse API response and store it for billing and analysis.
Lambda Timeout Cascades
- Explanation: In a parallel execution model, a single slow model can cause the entire Lambda function to timeout, discarding all results.
- Fix: Implement per-request timeouts and use
Promise.allSettledto handle partial failures gracefully. Configure Lambda timeout slightly higher than the maximum expected model latency.
Premature Database Adoption
- Explanation: Introducing a database like RDS or DynamoDB early adds complexity and cost when simple file storage suffices for initial workloads.
- Fix: Start with S3 for storing experiment data and reports. Migrate to DynamoDB only when query patterns (e.g., filtering by user or date) require indexed access.
Lack of Input Validation
- Explanation: Malformed inputs can trigger expensive model invocations that fail or produce garbage results.
- Fix: Validate all inputs (schema, token limits, model IDs) before invoking Bedrock. Fail fast to save costs.
Production Bundle
Action Checklist
- Enable Billing Alarms: Configure CloudWatch alarms for Bedrock costs at $10 and $25 thresholds with SNS notifications.
- Secure API Endpoints: Apply API Keys, Usage Plans, and rate limiting to all evaluation endpoints.
- Implement Validation: Add schema validation and token limit checks before model invocation.
- Use Converse API: Standardize on the Bedrock Converse API for unified model access and token metadata.
- Parallelize Inference: Use
Promise.allSettledwithin Lambda for parallel model calls to optimize latency and cost. - Capture Token Usage: Parse and store
inputTokensandoutputTokensfrom every response for billing and analysis. - Store Checkpoints: Write intermediate results to S3 to enable retries without re-invoking models.
- Deploy Judge Model: Configure a separate model for scoring to avoid self-evaluation bias.
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|---|---|---|
| High Concurrency, Low Latency | In-Lambda Parallelism | Single invocation reduces overhead; parallelism minimizes wall-clock time. | Low (1 invocation per batch) |
| Complex Error Handling | Step Functions Parallel | Native retry logic and error handling per branch; easier to debug. | Medium (N invocations) |
| Simple Storage Needs | Amazon S3 | Cheap, durable, and sufficient for sequential read/write patterns. | Very Low |
| Query-Heavy Workloads | DynamoDB | Fast, indexed queries for experiment history and user sessions. | Medium (Read/Write capacity) |
| Model Swapping | Converse API | Unified interface allows changing models without code changes. | Neutral |
Configuration Template
The following CDK snippet defines the Step Functions state machine for the evaluation pipeline.
import * as cdk from 'aws-cdk-lib';
import * as sfn from 'aws-cdk-lib/aws-stepfunctions';
import * as tasks from 'aws-cdk-lib/aws-stepfunctions-tasks';
import * as lambda from 'aws-cdk-lib/aws-lambda';
import * as path from 'path';
export class EvaluationPipelineStack extends cdk.Stack {
constructor(scope: cdk.App, id: string, props?: cdk.StackProps) {
super(scope, id, props);
const validateFn = new lambda.Function(this, 'ValidateFunction', {
runtime: lambda.Runtime.NODEJS_20_X,
handler: 'validate.handler',
code: lambda.Code.fromAsset(path.join(__dirname, 'lambda')),
});
const invokeFn = new lambda.Function(this, 'InvokeFunction', {
runtime: lambda.Runtime.NODEJS_20_X,
handler: 'invoke.handler',
code: lambda.Code.fromAsset(path.join(__dirname, 'lambda')),
timeout: cdk.Duration.seconds(60),
});
const scoreFn = new lambda.Function(this, 'ScoreFunction', {
runtime: lambda.Runtime.NODEJS_20_X,
handler: 'score.handler',
code: lambda.Code.fromAsset(path.join(__dirname, 'lambda')),
timeout: cdk.Duration.seconds(30),
});
const reportFn = new lambda.Function(this, 'ReportFunction', {
runtime: lambda.Runtime.NODEJS_20_X,
handler: 'report.handler',
code: lambda.Code.fromAsset(path.join(__dirname, 'lambda')),
});
const definition = new sfn.Pass(this, 'Start', {
result: sfn.Result.fromObject({ stage: 'validate' }),
}).next(
new tasks.LambdaInvoke(this, 'Validate', {
lambdaFunction: validateFn,
payload: sfn.TaskInput.fromJsonPathAt('$'),
}).next(
new tasks.LambdaInvoke(this, 'InvokeModels', {
lambdaFunction: invokeFn,
payload: sfn.TaskInput.fromJsonPathAt('$'),
}).next(
new tasks.LambdaInvoke(this, 'ScoreOutputs', {
lambdaFunction: scoreFn,
payload: sfn.TaskInput.fromJsonPathAt('$'),
}).next(
new tasks.LambdaInvoke(this, 'GenerateReport', {
lambdaFunction: reportFn,
payload: sfn.TaskInput.fromJsonPathAt('$'),
})
)
)
)
);
new sfn.StateMachine(this, 'EvaluationStateMachine', {
definition,
timeout: cdk.Duration.minutes(5),
});
}
}
Quick Start Guide
- Deploy Infrastructure: Use the provided CDK template to deploy the Lambda functions, Step Functions state machine, and S3 buckets.
- Configure Bedrock Access: Ensure the Lambda execution roles have
bedrock:InvokeModelpermissions for the required models. - Trigger Evaluation: Send a POST request to the API Gateway endpoint with the evaluation payload:
curl -X POST https://<api-id>.execute-api.<region>.amazonaws.com/prod/evaluate \ -H "Content-Type: application/json" \ -H "x-api-key: <your-api-key>" \ -d '{ "article": "Long text content...", "models": ["meta.llama3-70b-instruct-v1:0", "anthropic.claude-3-sonnet"], "prompt": "Summarize this as a podcast script.", "scoringCriteria": { "accuracy": 10, "engagement": 10, "structure": 10 } }' - Monitor Progress: Check CloudWatch Logs for Lambda execution and Step Functions console for workflow status.
- Retrieve Report: Once the workflow completes, download the HTML report from the S3 bucket URL returned in the response.
Mid-Year Sale — Unlock Full Article
Base plan from just $4.99/mo or $49/yr
Sign in to read the full article and unlock all tutorials.
Sign In / Register — Start Free Trial7-day free trial · Cancel anytime · 30-day money-back
