Building an Agent Evaluation Pipeline with Google ADK

By Codcompass Team·2026-05-11·8 min read

Beyond Pass/Fail: Structuring Agent Quality Gates with Google ADK

Current Situation Analysis

Traditional software engineering relies on deterministic contracts: given input X, the system must produce output Y. Unit tests, integration suites, and regression pipelines thrive on this predictability. Generative agents shatter that contract. Because large language models operate probabilistically, identical prompts can yield divergent reasoning paths, tool selections, and final outputs. Forcing deterministic test frameworks onto agent architectures creates false confidence. Teams ship features that pass local validation but fail unpredictably in production, masking behavioral drift behind surface-level correctness.

This gap is frequently overlooked because engineering teams conflate output accuracy with system reliability. An agent might return the correct weather forecast while internally invoking three irrelevant APIs, hallucinating intermediate states, or violating rate limits. Without process-aware validation, quality plateaus. Development cycles devolve into reactive patching rather than systematic improvement.

Industry analysis consistently points to a single root cause: the absence of structured evaluation pipelines. When teams skip agent-specific evals, three predictable failure modes emerge:

Regression cascades: Fixing one behavioral flaw inadvertently triggers new failure modes elsewhere in the reasoning chain.
Blind spots in effectiveness: Teams rely on subjective validation ("it feels right") rather than quantifiable performance metrics across task distributions.
Prompt inflation: Engineers attempt to hardcode edge-case handling directly into system prompts, resulting in brittle, unmaintainable instruction sets that degrade model performance.

Robust evaluation is not a testing afterthought. It is the foundational feedback loop that transforms agent development from experimental prototyping into production-grade engineering. Google's Agent Development Kit (ADK) addresses this by decoupling evaluation into two distinct axes: trajectory validation (how the agent reaches a conclusion) and response validation (what the agent ultimately delivers). Mastering both axes is the difference between shipping a demo and shipping a reliable system.

WOW Moment: Key Findings

Agent evaluation is not a single metric. It is a multi-dimensional scoring matrix. The critical insight is that trajectory analysis and response analysis measure fundamentally different properties, and combining them reveals failure modes that either axis misses in isolation.

Evaluation Axis	Cost Profile	Latency Impact	Precision Level	Primary Use Case
Trajectory Matching	Near-zero (deterministic)	Milliseconds	High (structural)	Tool sequencing, API compliance, reasoning path validation
Token Overlap (ROUGE-1)	Near-zero (deterministic)	Milliseconds	Medium (lexical)	Strict formatting requirements, exact phrase matching
Semantic Judge (LLM-as-Judge)	API-dependent (per-call)	Seconds	High (contextual)	Open-ended responses, paraphrase tolerance, intent alignment
Rubric-Based Judge	API-dependent (per-call)	Seconds	High (qualitative)	Summarization, creative generation, multi-property validation

Why this matters: Trajectory evaluation catches behavioral drift that output metrics completely miss. An agent can score 1.0 on semantic similarity while violating security policies, calling deprecated endpoints, or entering infinite tool loops. Conversely, perfect trajectory matching guarantees nothing about the final answer's usefulness. Production systems require a hybrid scoring strategy: trajectory gates for safety and compliance, combined with semantic or rubric-based judges for quality and user intent. This dual-axis approach enables automated quality gates that scale with model updates and prompt iterations.

Core Solution

Building an evaluation pipeline with ADK requires three distinct phases: constructing the test dataset, defining the scoring criteria, and orchestrating execution. The architecture prioritizes decoupling, async execution, and CI-native integration.

Step 1: Construct the Evaluation Dataset

ADK supports two input formats: .test.json for single-turn sessions and .evalset.json for multi-turn conversations. For production agents, .evalset.json is the standard. It captures conversation history, expected responses, and initial session state.

{
  "eval_set_id": "logistics_qa_v1",
  "name": "Supply Chain Assistant Evaluation",
  "description": "Multi-turn validation for inventory lookup and routing logic",
  "eval_cases": [
    {
      "eval_id": "inventory_check_with_context",
      "conversation": [
        {
          "user_content": {
            "parts": [{"text": "Check stock for SKU-8842 at warehouse B."}]
          },
          "final_response": {
            "parts": [{"text": "SKU-8842 has 142 units available at Warehouse B."}]
          }
        },
        {
          "user_content": {
            "parts": [{"text": "Reserve 20 units for order #9910."}]
          },
          "final_response": {
            "parts": [{"text": "Reservation confirmed. 122 units remaining."}]
          }
        }
      ],
      "session_input": {
        "app_name": "logistics_agent",
        "user_id": "eval_operator_01",
        "state": {
          "warehouse_region": "us-east-1",
          "auth_token": "mock_eval_token"
        }
      }
    }
  ]
}

Architecture decision: We seed session_input.state explicitly. Agents often depend on contextual variables (user roles, regional settings, cached credentials). Omitting state initialization forces the agent to reconstruct context from scratch, introducing variance that corrupts evaluation consistency.

Step 2: Define the Scoring Criteria

ADK provides eleven built-in criteria. Production pipelines typically combine structural checks with semantic judges. Criteria are defined in a separate configuration file to enable parallel iteration between test data and scoring logic.

{
  "cr

iteria": { "tool_trajectory_avg_score": { "threshold": 0.85, "match_type": "IN_ORDER" }, "final_response_match_v2": { "threshold": 0.90, "judge_model": "gemini-2.0-flash" }, "rubric_based_final_response_quality_v1": { "threshold": 0.80, "rubrics": [ { "rubric_id": "data_precision", "rubric_content": { "text_property": "The response includes exact numeric values without rounding or estimation." } }, { "rubric_id": "action_clarity", "rubric_content": { "text_property": "The response explicitly states the next recommended step or confirms completion." } } ] } } }


**Architecture decision:** We use `IN_ORDER` for trajectory matching instead of `EXACT`. Strict exact matching fails when models introduce harmless intermediate steps (e.g., logging calls, cache checks). `IN_ORDER` validates the critical path while tolerating implementation variance. We pair this with a semantic judge for response quality and a rubric-based judge for property validation. This triad covers safety, accuracy, and usability.

### Step 3: Orchestrate Execution

ADK exposes `AgentEvaluator` for programmatic execution. While the CLI (`adk eval`) and Web UI (`adk web`) exist, pytest integration is the production standard. It enables async execution, native CI/CD hooks, and explicit error reporting.

```python
import asyncio
import json
import pytest
from pathlib import Path
from google.adk.evaluation.agent_evaluator import AgentEvaluator
from google.adk.evaluation.eval_config import EvalConfig
from google.adk.evaluation.eval_set import EvalSet

@pytest.mark.asyncio
async def test_logistics_agent_quality_gates():
    eval_set_path = Path("tests/datasets/logistics.evalset.json")
    config_path = Path("tests/config/quality_gates.json")

    with eval_set_path.open("r", encoding="utf-8") as dataset_file:
        raw_eval_set = json.load(dataset_file)
        eval_set = EvalSet.model_validate(raw_eval_set)

    with config_path.open("r", encoding="utf-8") as config_file:
        raw_config = json.load(config_file)
        eval_config = EvalConfig.model_validate(raw_config)

    execution_result = await AgentEvaluator.evaluate_eval_set(
        agent_module="logistics_agent.core",
        eval_set=eval_set,
        eval_config=eval_config,
        raise_on_failure=True
    )

    assert execution_result.overall_score >= 0.85, (
        f"Agent quality gate failed. Score: {execution_result.overall_score}"
    )

Architecture decision: We use evaluate_eval_set instead of the higher-level evaluate wrapper. Explicit loading preserves visibility into the evaluation contract, simplifies debugging, and allows custom pre-processing (e.g., dynamic threshold injection, environment variable resolution). The raise_on_failure=True flag ensures CI pipelines halt on quality degradation, preventing regression deployment.

Pitfall Guide

1. Over-Indexing on Exact Trajectory Matching

Explanation: Using EXACT match type forces the agent to replicate every intermediate step. Modern LLMs frequently insert logging, cache validation, or fallback routing that breaks exact matching despite correct behavior. Fix: Switch to IN_ORDER for critical tool sequences. Reserve EXACT only for security-sensitive or compliance-critical workflows where step count and order are strictly mandated.

2. Ignoring Session State Initialization

Explanation: Agents often rely on contextual variables (user preferences, regional settings, cached tokens). Evaluating without seeding state forces the model to infer context, introducing non-determinism that corrupts scoring. Fix: Always populate session_input.state in eval sets. Mock external dependencies where possible to isolate agent reasoning from infrastructure variance.

3. Treating ROUGE-1 as Semantic Truth

Explanation: response_match_score measures token overlap, not meaning. It fails on paraphrasing, synonym substitution, and structural reordering. A 0.95 ROUGE score can mask completely different user intent. Fix: Use ROUGE-1 only for strict formatting requirements (e.g., JSON schema compliance, exact code generation). Pair it with final_response_match_v2 for semantic validation.

4. Unbounded LLM Judge Costs in CI

Explanation: Semantic and rubric-based criteria invoke external judge models. Running these on every commit in a high-velocity repository quickly accumulates API costs and pipeline latency. Fix: Implement tiered evaluation. Run trajectory and ROUGE checks on every PR. Gate semantic/rubric judges behind nightly runs or manual triggers. Cache judge responses for identical prompt-response pairs.

5. Static Thresholds for Dynamic Models

Explanation: Hardcoding thresholds (e.g., 0.90) assumes model behavior is static. Upgrading base models, adjusting temperature, or modifying system prompts shifts score distributions. Fix: Implement dynamic thresholding based on historical baselines. Track score drift over time and adjust gates proportionally. Use rolling averages rather than absolute cutoffs for semantic judges.

6. Evaluating Tools Without Validating Arguments

Explanation: Trajectory matching checks tool names and order but may overlook argument correctness. An agent can call the right tool with malformed parameters, causing downstream failures. Fix: Extend trajectory validation with argument schema checks. Use match_type: "IN_ORDER" combined with custom post-processing that validates parameter types, ranges, and required fields before scoring.

7. Missing Conversational Context in Eval Sets

Explanation: Single-turn eval sets ignore state carryover, memory retrieval, and multi-turn reasoning. Agents that excel in isolation often fail when context accumulates across turns. Fix: Prioritize .evalset.json with multi-turn conversations. Include at least 30% of eval cases that require context retention, correction handling, and state mutation.

Production Bundle

Action Checklist

Seed session state in every eval case to eliminate context variance
Use IN_ORDER trajectory matching unless strict compliance is mandated
Implement tiered evaluation: deterministic checks on PR, semantic judges on nightly
Cache judge model responses for identical prompt-response pairs to control costs
Track score drift over time and adjust thresholds dynamically
Validate tool arguments alongside trajectory sequencing
Include multi-turn cases that test memory retention and state mutation
Gate CI deployments on overall score thresholds, not individual case pass rates

Decision Matrix

Scenario	Recommended Approach	Why	Cost Impact
Security-critical tool routing	`tool_trajectory_avg_score` with `EXACT` match	Guarantees step order and prevents unauthorized API calls	Near-zero
Open-ended summarization	`rubric_based_final_response_quality_v1`	Validates qualitative properties without requiring ground truth	Moderate (per-call judge)
Strict formatting/code generation	`response_match_score` (ROUGE-1)	Token overlap correlates strongly with schema compliance	Near-zero
User intent alignment	`final_response_match_v2`	LLM judge tolerates paraphrasing while preserving semantic equivalence	Moderate-High (per-call judge)
High-velocity CI pipeline	Deterministic criteria only + nightly semantic gates	Balances feedback speed with cost control	Low (PR) / Moderate (nightly)

Configuration Template

{
  "criteria": {
    "tool_trajectory_avg_score": {
      "threshold": 0.85,
      "match_type": "IN_ORDER"
    },
    "final_response_match_v2": {
      "threshold": 0.90,
      "judge_model": "gemini-2.0-flash"
    },
    "rubric_based_final_response_quality_v1": {
      "threshold": 0.80,
      "rubrics": [
        {
          "rubric_id": "factual_accuracy",
          "rubric_content": {
            "text_property": "All claims are verifiable and free of hallucination."
          }
        },
        {
          "rubric_id": "tone_consistency",
          "rubric_content": {
            "text_property": "Response maintains professional tone without unnecessary filler."
          }
        }
      ]
    }
  }
}

Quick Start Guide

Initialize project structure: Create tests/datasets/ for eval sets and tests/config/ for scoring criteria. Place your agent module in the project root.
Define your first eval case: Write a .evalset.json file with one multi-turn conversation. Seed session_input.state with required context variables.
Configure scoring criteria: Create quality_gates.json with tool_trajectory_avg_score (threshold 0.85, IN_ORDER) and final_response_match_v2 (threshold 0.90).
Write the pytest runner: Implement an async test function that loads the eval set and config, then calls AgentEvaluator.evaluate_eval_set with raise_on_failure=True.
Execute and iterate: Run pytest. Review trajectory mismatches and semantic scores. Adjust thresholds or rubrics based on observed drift. Scale to additional eval cases as coverage improves.