Building an Agent Evaluation Pipeline with Google ADK
Beyond Pass/Fail: Structuring Agent Quality Gates with Google ADK
Current Situation Analysis
Traditional software engineering relies on deterministic contracts: given input X, the system must produce output Y. Unit tests, integration suites, and regression pipelines thrive on this predictability. Generative agents shatter that contract. Because large language models operate probabilistically, identical prompts can yield divergent reasoning paths, tool selections, and final outputs. Forcing deterministic test frameworks onto agent architectures creates false confidence. Teams ship features that pass local validation but fail unpredictably in production, masking behavioral drift behind surface-level correctness.
This gap is frequently overlooked because engineering teams conflate output accuracy with system reliability. An agent might return the correct weather forecast while internally invoking three irrelevant APIs, hallucinating intermediate states, or violating rate limits. Without process-aware validation, quality plateaus. Development cycles devolve into reactive patching rather than systematic improvement.
Industry analysis consistently points to a single root cause: the absence of structured evaluation pipelines. When teams skip agent-specific evals, three predictable failure modes emerge:
- Regression cascades: Fixing one behavioral flaw inadvertently triggers new failure modes elsewhere in the reasoning chain.
- Blind spots in effectiveness: Teams rely on subjective validation ("it feels right") rather than quantifiable performance metrics across task distributions.
- Prompt inflation: Engineers attempt to hardcode edge-case handling directly into system prompts, resulting in brittle, unmaintainable instruction sets that degrade model performance.
Robust evaluation is not a testing afterthought. It is the foundational feedback loop that transforms agent development from experimental prototyping into production-grade engineering. Google's Agent Development Kit (ADK) addresses this by decoupling evaluation into two distinct axes: trajectory validation (how the agent reaches a conclusion) and response validation (what the agent ultimately delivers). Mastering both axes is the difference between shipping a demo and shipping a reliable system.
WOW Moment: Key Findings
Agent evaluation is not a single metric. It is a multi-dimensional scoring matrix. The critical insight is that trajectory analysis and response analysis measure fundamentally different properties, and combining them reveals failure modes that either axis misses in isolation.
| Evaluation Axis | Cost Profile | Latency Impact | Precision Level | Primary Use Case |
|---|---|---|---|---|
| Trajectory Matching | Near-zero (deterministic) | Milliseconds | High (structural) | Tool sequencing, API compliance, reasoning path validation |
| Token Overlap (ROUGE-1) | Near-zero (deterministic) | Milliseconds | Medium (lexical) | Strict formatting requirements, exact phrase matching |
| Semantic Judge (LLM-as-Judge) | API-dependent (per-call) | Seconds | High (contextual) | Open-ended responses, paraphrase tolerance, intent alignment |
| Rubric-Based Judge | API-dependent (per-call) | Seconds | High (qualitative) | Summarization, creative generation, multi-property validation |
Why this matters: Trajectory evaluation catches behavioral drift that output metrics completely miss. An agent can score 1.0 on semantic similarity while violating security policies, calling deprecated endpoints, or entering infinite tool loops. Conversely, perfect trajectory matching guarantees nothing about the final answer's usefulness. Production systems require a hybrid scoring strategy: trajectory gates for safety and compliance, combined with semantic or rubric-based judges for quality and user intent. This dual-axis approach enables automated quality gates that scale with model updates and prompt iterations.
Core Solution
Building an evaluation pipeline with ADK requires three distinct phases: constructing the test dataset, defining the scoring criteria, and orchestrating execution. The architecture prioritizes decoupling, async execution, and CI-native integration.
Step 1: Construct the Evaluation Dataset
ADK supports two input formats: .test.json for single-turn sessions and .evalset.json for multi-turn conversations. For production agents, .evalset.json is the standard. It captures conversation history, expected responses, and initial session state.
{
"eval_set_id": "logistics_qa_v1",
"name": "Supply Chain Assistant Evaluation",
"description": "Multi-turn validation for inventory lookup and routing logic",
"eval_cases": [
{
"eval_id": "inventory_check_with_context",
"conversation": [
{
"user_content": {
"parts": [{"text": "Check stock for SKU-8842 at warehouse B."}]
},
"final_response": {
"parts": [{"text": "SKU-8842 has 142 units available at Warehouse B."}]
}
},
{
"user_content": {
"parts": [{"text": "Reserve 20 units for order #9910."}]
},
"final_response": {
"parts": [{"text": "Reservation confirmed. 122 units remaining."}]
}
}
],
"session_input": {
"app_name": "logistics_agent",
"user_id": "eval_operator_01",
"state": {
"warehouse_region": "us-east-1",
"auth_token": "mock_eval_token"
}
}
}
]
}
Architecture decision: We seed session_input.state explicitly. Agents often depend on contextual variables (user roles, regional settings, cached credentials). Omitting state initialization forces the agent to reconstruct context from scratch, introducing variance that corrupts evaluation consistency.
Step 2: Define the Scoring Criteria
ADK provides eleven built-in criteria. Production pipelines typically combine structural checks with semantic judges. Criteria are defined in a separate configuration file to enable parallel iteration between test data and scoring logic.
{
"cr
iteria": { "tool_trajectory_avg_score": { "threshold": 0.85, "match_type": "IN_ORDER" }, "final_response_match_v2": { "threshold": 0.90, "judge_model": "gemini-2.0-flash" }, "rubric_based_final_response_quality_v1": { "threshold": 0.80, "rubrics": [ { "rubric_id": "data_precision", "rubric_content": { "text_property": "The response includes exact numeric values without rounding or estimation." } }, { "rubric_id": "action_clarity", "rubric_content": { "text_property": "The response explicitly states the next recommended step or confirms completion." } } ] } } }
**Architecture decision:** We use `IN_ORDER` for trajectory matching instead of `EXACT`. Strict exact matching fails when models introduce harmless intermediate steps (e.g., logging calls, cache checks). `IN_ORDER` validates the critical path while tolerating implementation variance. We pair this with a semantic judge for response quality and a rubric-based judge for property validation. This triad covers safety, accuracy, and usability.
### Step 3: Orchestrate Execution
ADK exposes `AgentEvaluator` for programmatic execution. While the CLI (`adk eval`) and Web UI (`adk web`) exist, pytest integration is the production standard. It enables async execution, native CI/CD hooks, and explicit error reporting.
```python
import asyncio
import json
import pytest
from pathlib import Path
from google.adk.evaluation.agent_evaluator import AgentEvaluator
from google.adk.evaluation.eval_config import EvalConfig
from google.adk.evaluation.eval_set import EvalSet
@pytest.mark.asyncio
async def test_logistics_agent_quality_gates():
eval_set_path = Path("tests/datasets/logistics.evalset.json")
config_path = Path("tests/config/quality_gates.json")
with eval_set_path.open("r", encoding="utf-8") as dataset_file:
raw_eval_set = json.load(dataset_file)
eval_set = EvalSet.model_validate(raw_eval_set)
with config_path.open("r", encoding="utf-8") as config_file:
raw_config = json.load(config_file)
eval_config = EvalConfig.model_validate(raw_config)
execution_result = await AgentEvaluator.evaluate_eval_set(
agent_module="logistics_agent.core",
eval_set=eval_set,
eval_config=eval_config,
raise_on_failure=True
)
assert execution_result.overall_score >= 0.85, (
f"Agent quality gate failed. Score: {execution_result.overall_score}"
)
Architecture decision: We use evaluate_eval_set instead of the higher-level evaluate wrapper. Explicit loading preserves visibility into the evaluation contract, simplifies debugging, and allows custom pre-processing (e.g., dynamic threshold injection, environment variable resolution). The raise_on_failure=True flag ensures CI pipelines halt on quality degradation, preventing regression deployment.
Pitfall Guide
1. Over-Indexing on Exact Trajectory Matching
Explanation: Using EXACT match type forces the agent to replicate every intermediate step. Modern LLMs frequently insert logging, cache validation, or fallback routing that breaks exact matching despite correct behavior.
Fix: Switch to IN_ORDER for critical tool sequences. Reserve EXACT only for security-sensitive or compliance-critical workflows where step count and order are strictly mandated.
2. Ignoring Session State Initialization
Explanation: Agents often rely on contextual variables (user preferences, regional settings, cached tokens). Evaluating without seeding state forces the model to infer context, introducing non-determinism that corrupts scoring.
Fix: Always populate session_input.state in eval sets. Mock external dependencies where possible to isolate agent reasoning from infrastructure variance.
3. Treating ROUGE-1 as Semantic Truth
Explanation: response_match_score measures token overlap, not meaning. It fails on paraphrasing, synonym substitution, and structural reordering. A 0.95 ROUGE score can mask completely different user intent.
Fix: Use ROUGE-1 only for strict formatting requirements (e.g., JSON schema compliance, exact code generation). Pair it with final_response_match_v2 for semantic validation.
4. Unbounded LLM Judge Costs in CI
Explanation: Semantic and rubric-based criteria invoke external judge models. Running these on every commit in a high-velocity repository quickly accumulates API costs and pipeline latency. Fix: Implement tiered evaluation. Run trajectory and ROUGE checks on every PR. Gate semantic/rubric judges behind nightly runs or manual triggers. Cache judge responses for identical prompt-response pairs.
5. Static Thresholds for Dynamic Models
Explanation: Hardcoding thresholds (e.g., 0.90) assumes model behavior is static. Upgrading base models, adjusting temperature, or modifying system prompts shifts score distributions.
Fix: Implement dynamic thresholding based on historical baselines. Track score drift over time and adjust gates proportionally. Use rolling averages rather than absolute cutoffs for semantic judges.
6. Evaluating Tools Without Validating Arguments
Explanation: Trajectory matching checks tool names and order but may overlook argument correctness. An agent can call the right tool with malformed parameters, causing downstream failures.
Fix: Extend trajectory validation with argument schema checks. Use match_type: "IN_ORDER" combined with custom post-processing that validates parameter types, ranges, and required fields before scoring.
7. Missing Conversational Context in Eval Sets
Explanation: Single-turn eval sets ignore state carryover, memory retrieval, and multi-turn reasoning. Agents that excel in isolation often fail when context accumulates across turns.
Fix: Prioritize .evalset.json with multi-turn conversations. Include at least 30% of eval cases that require context retention, correction handling, and state mutation.
Production Bundle
Action Checklist
- Seed session state in every eval case to eliminate context variance
- Use
IN_ORDERtrajectory matching unless strict compliance is mandated - Implement tiered evaluation: deterministic checks on PR, semantic judges on nightly
- Cache judge model responses for identical prompt-response pairs to control costs
- Track score drift over time and adjust thresholds dynamically
- Validate tool arguments alongside trajectory sequencing
- Include multi-turn cases that test memory retention and state mutation
- Gate CI deployments on overall score thresholds, not individual case pass rates
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|---|---|---|
| Security-critical tool routing | tool_trajectory_avg_score with EXACT match | Guarantees step order and prevents unauthorized API calls | Near-zero |
| Open-ended summarization | rubric_based_final_response_quality_v1 | Validates qualitative properties without requiring ground truth | Moderate (per-call judge) |
| Strict formatting/code generation | response_match_score (ROUGE-1) | Token overlap correlates strongly with schema compliance | Near-zero |
| User intent alignment | final_response_match_v2 | LLM judge tolerates paraphrasing while preserving semantic equivalence | Moderate-High (per-call judge) |
| High-velocity CI pipeline | Deterministic criteria only + nightly semantic gates | Balances feedback speed with cost control | Low (PR) / Moderate (nightly) |
Configuration Template
{
"criteria": {
"tool_trajectory_avg_score": {
"threshold": 0.85,
"match_type": "IN_ORDER"
},
"final_response_match_v2": {
"threshold": 0.90,
"judge_model": "gemini-2.0-flash"
},
"rubric_based_final_response_quality_v1": {
"threshold": 0.80,
"rubrics": [
{
"rubric_id": "factual_accuracy",
"rubric_content": {
"text_property": "All claims are verifiable and free of hallucination."
}
},
{
"rubric_id": "tone_consistency",
"rubric_content": {
"text_property": "Response maintains professional tone without unnecessary filler."
}
}
]
}
}
}
Quick Start Guide
- Initialize project structure: Create
tests/datasets/for eval sets andtests/config/for scoring criteria. Place your agent module in the project root. - Define your first eval case: Write a
.evalset.jsonfile with one multi-turn conversation. Seedsession_input.statewith required context variables. - Configure scoring criteria: Create
quality_gates.jsonwithtool_trajectory_avg_score(threshold0.85,IN_ORDER) andfinal_response_match_v2(threshold0.90). - Write the pytest runner: Implement an async test function that loads the eval set and config, then calls
AgentEvaluator.evaluate_eval_setwithraise_on_failure=True. - Execute and iterate: Run
pytest. Review trajectory mismatches and semantic scores. Adjust thresholds or rubrics based on observed drift. Scale to additional eval cases as coverage improves.
