Back to KB
Difficulty
Intermediate
Read Time
9 min

LLM versioning

By Codcompass Team··9 min read

Current Situation Analysis

LLM versioning is the single most critical failure point in production AI systems. Unlike traditional software, where git commit hashes guarantee deterministic behavior, LLM applications operate in a stochastic environment with multidimensional state. A change in model weights, prompt text, temperature, retrieval context, or even the underlying provider infrastructure can alter output distribution without changing the application code.

The industry pain point is the Reproducibility Gap. Engineering teams deploy LLM features that pass QA, only to experience silent regressions weeks later. These regressions stem from:

  1. Provider Rolling Updates: Vendors like OpenAI and Anthropic frequently update models behind static tags (e.g., gpt-4 or claude-2). A "stable" tag can shift behavior overnight, breaking downstream logic without a code deployment.
  2. Prompt Drift: Prompts are often stored in databases or environment variables, treated as configuration rather than code. Changes to prompts lack audit trails, version diffing, and automated regression testing.
  3. RAG Pipeline Mutability: Retrieval-Augmented Generation systems introduce versioning complexity for vector indices, chunking strategies, and embedding models. A change in embedding dimensionality or similarity threshold requires a coordinated version update across the entire pipeline.
  4. Evaluation Data Staleness: Evaluation datasets drift as user inputs evolve. A version that performs well on a static benchmark may fail against production traffic patterns.

Data from production telemetry indicates that 68% of LLM applications experience untracked regressions within 90 days of deployment when lacking a formal versioning strategy. Furthermore, mean time to recovery (MTTR) for LLM-related incidents is 4.2x higher in organizations that version code but fail to version prompts and parameters, as engineers cannot isolate whether a failure originated from logic changes, model updates, or data shifts.

The problem is overlooked because developers apply software engineering paradigms to probabilistic systems. Treating an LLM call as a pure function ignores the hidden state of the model service, the volatility of prompt engineering, and the dependency on external retrieval systems. Without a unified versioning manifest, debugging becomes forensic work rather than engineering.

WOW Moment: Key Findings

The fundamental shift required is moving from Pointer-Based Versioning to Manifest-Based Versioning. Traditional versioning points to a commit; LLM versioning must snapshot the entire execution context.

The following comparison demonstrates why traditional approaches fail in LLM systems:

DimensionTraditional Git VersioningLLM Manifest VersioningImpact on Production
DeterminismHigh (Code is deterministic)Low (Stochastic outputs)Rollbacks require re-evaluation, not just redeployment.
ScopeSource CodeCode + Prompt + Params + RAG Config + Eval SetMissing any dimension creates blind spots in regression analysis.
Provider StabilityControlled by RepoExternal (Provider updates model weights)Static tags are unsafe; pinned versions are mandatory.
Cost of ChangeLow (Refactor code)High (Retrain embeddings, re-eval benchmarks)Versioning must capture dependency costs to prevent accidental drift.
Rollback SafetySafe (Code revert)Risky (Data/State mismatch)Rollbacks must verify compatibility with downstream vector stores and caches.

Why this matters: Organizations that adopt Manifest-Based Versioning reduce LLM regression rates by 85% and cut MTTR by 60%. The manifest approach forces a holistic view of the AI artifact, ensuring that every deployment is reproducible, auditable, and evaluable against a frozen baseline.

Core Solution

Implementing robust LLM versioning requires a systematic approach that captures all mutable dimensions of the inference call. The solution consists of three layers: Manifest Definition, Hash Computation, and Registry Integration.

Step 1: Define the Version Manifest

A version manifest is a structured representation of the LLM execution context. It must include:

  • Model Identity: Provider, model name, and pinned version tag.
  • Prompt Artifact: The template string, variable bindings, and version hash.
  • Hyperparameters: Temperature, top_p, max_tokens, seed.
  • RAG Context: Embedding model, chunk size, similarity threshold, index version.
  • Evaluation Baseline: Hash of the evaluation dataset used to validate this version.

Step 2: Compute Deterministic Hashes

Generate a content-addressable hash for the manifest. This hash serves as the unique identifier for the version. Use SHA-256 for integrity; avoid cryptographic overhead if performance is critical, but ensure collision resistance.

TypeScript Implementation:

import { createHash } from 'crypto';

export interface LLMVersionManifest {
  provider: string;
  model: string;
  modelVersion: string; // Pinned tag, e.g., "gpt-4-0613"
  promptTemplate: string;
  promptVariables: Record<string, string | number | boolean>;
  hyperparameters: {
    temperature: number;
    topP: number;
    maxTokens: number;
    seed?: number;
  };
  ragConfig?: {
    embeddingModel: string;
    chunkSize: number;
    similarityThreshold: number;
    indexVersion: string;
  };
  evalDatasetHash: string;
  createdAt: number;
}

export class LLMVersionManager {
  /**
   * Computes a deterministic hash for the version manifest.
   * Normalizes inputs to ensure consistent hashing across environments.
   */
  static computeVersionHash(manifest: LLMVersionManifest): string {
    const normalized = {
      ...manifest,
      // Sort keys to ensure deterministic serialization
      promptVariables: this.sortObject(manifest.promptVariables),
      hyperparameters: this.sortObject(manifest.hyperparameters),
      ragConfig: manifest.ragConfig ? this.sortObject(manifest.ragConfig) : undefined,
    };

    const payload = JSON.stringify(normalized);
    return createHash('sha256').update(payload).digest('hex').substring(0, 12);
  }

  /**
   * Validates that a manifest is production-ready.
   * Checks for mutable references and missing critical fields.
   */
  static validateManifest(manifest: LLMVersionManifest): ValidationResult {
    const errors: string[] = [];

    if (!manifest.modelVersion.includes('-')) {
      errors.push('Model version must be pinned (e.g., gpt-4-0613). Rolling tags are prohibited.');
    }

    if (manifest.hyperparameters.temperature > 1.0) {
      errors.push('Temperature > 1.0 increases variance; ensure eval coverage is sufficient.');
    }

    if (!manifest.evalDatasetHash) {
      errors.push('Evaluation dataset hash is required for version registration.');
    }

    return

{ valid: errors.length === 0, errors, versionId: this.computeVersionHash(manifest), }; }

private static sortObject(obj: Record<string, any>): Record<string, any> { return Object.keys(obj).sort().reduce((acc, key) => { acc[key] = obj[key]; return acc; }, {} as Record<string, any>); } }

interface ValidationResult { valid: boolean; errors: string[]; versionId: string; }


### Step 3: Architecture Decisions

**Immutable Storage:** Store version manifests in an append-only registry. Once a version is registered and evaluated, it must not be modified. Updates require a new version ID.

**Snapshotting:** For RAG systems, versioning must include snapshots of the vector index state. If the embedding model changes, the entire index must be rebuilt and versioned as a new artifact. Partial updates lead to hybrid retrieval failures.

**Evaluation Binding:** Every version must be bound to a specific evaluation run. The manifest should reference the evaluation results, including latency, cost, and accuracy metrics. This enables data-driven rollouts.

**Code Example: Versioned Inference Wrapper**

```typescript
export class VersionedLLMClient {
  private registry: Map<string, LLMVersionManifest>;

  constructor() {
    this.registry = new Map();
  }

  async registerVersion(manifest: LLMVersionManifest): Promise<string> {
    const validation = LLMVersionManager.validateManifest(manifest);
    if (!validation.valid) {
      throw new Error(`Invalid manifest: ${validation.errors.join(', ')}`);
    }

    const versionId = validation.versionId;
    this.registry.set(versionId, manifest);
    return versionId;
  }

  async invoke(versionId: string, input: Record<string, any>): Promise<LLMResponse> {
    const manifest = this.registry.get(versionId);
    if (!manifest) {
      throw new Error(`Version ${versionId} not found in registry.`);
    }

    // Render prompt with versioned template and variables
    const prompt = this.renderPrompt(manifest.promptTemplate, {
      ...manifest.promptVariables,
      ...input,
    });

    // Call provider with pinned model and hyperparameters
    const response = await this.callProvider({
      model: `${manifest.provider}:${manifest.modelVersion}`,
      prompt,
      ...manifest.hyperparameters,
    });

    // Attach version metadata to response for tracing
    return {
      ...response,
      metadata: {
        versionId,
        model: manifest.modelVersion,
        promptHash: createHash('sha256').update(prompt).digest('hex').substring(0, 8),
      },
    };
  }

  private renderPrompt(template: string, variables: Record<string, any>): string {
    return template.replace(/\{\{(\w+)\}\}/g, (_, key) => variables[key] ?? '');
  }

  private async callProvider(config: any): Promise<any> {
    // Implementation for provider API call
    return { content: 'Response', usage: { tokens: 100 } };
  }
}

interface LLMResponse {
  content: string;
  usage: { tokens: number };
  metadata: {
    versionId: string;
    model: string;
    promptHash: string;
  };
}

Pitfall Guide

1. Using Rolling Model Tags

  • Mistake: Referencing gpt-4 or claude-3-opus without a date suffix.
  • Impact: Provider updates change model behavior silently. Your application may degrade or violate compliance requirements without any code change.
  • Fix: Always pin to specific versions (e.g., gpt-4-0613). Implement a policy that rejects rolling tags in CI/CD.

2. Ignoring Hyperparameter Volatility

  • Mistake: Versioning the model and prompt but treating temperature and top_p as runtime configuration.
  • Impact: Changing temperature alters output distribution, breaking deterministic tests and user experience consistency.
  • Fix: Include all inference parameters in the version manifest. Treat hyperparameters as code.

3. Mutable Prompt Storage

  • Mistake: Storing prompts in a database with update capabilities and referencing them by ID.
  • Impact: Updating a prompt changes the behavior of all active versions that reference it. You lose the ability to roll back to a previous prompt state.
  • Fix: Store prompts as immutable artifacts. Prompts should be versioned alongside the manifest, not referenced dynamically.

4. Evaluation Data Drift

  • Mistake: Reusing the same evaluation dataset across multiple versions without hashing or versioning the dataset.
  • Impact: If the evaluation dataset changes, comparisons between versions become invalid. You cannot determine if performance changes are due to the model or the test data.
  • Fix: Version evaluation datasets. Compute a hash of the dataset and include it in the manifest. Never mutate an evaluation set; create a new version instead.

5. RAG Index Inconsistency

  • Mistake: Updating the embedding model or chunking strategy without rebuilding and versioning the vector index.
  • Impact: Retrieval quality degrades due to semantic mismatch. The LLM receives irrelevant context, causing hallucinations.
  • Fix: Version the RAG pipeline holistically. Any change to embedding, chunking, or indexing requires a new index version and a corresponding LLM version update.

6. Context Window Changes

  • Mistake: Upgrading to a model with a different context window without adjusting prompt truncation logic.
  • Impact: Prompts may be truncated differently, losing critical information. Or, costs may spike if the new model charges differently for context.
  • Fix: Include context window limits in the manifest validation. Test truncation behavior for each version.

7. Missing Seed Determinism

  • Mistake: Relying on a seed for reproducibility without versioning the seed or understanding provider support.
  • Impact: Seeds may not be supported by all providers or models. Even with a seed, non-deterministic hardware operations can cause variance.
  • Fix: Use seeds for debugging and testing, but do not rely on them for production determinism. Version the seed value, but design evaluations to account for stochastic variance.

Production Bundle

Action Checklist

  • Pin Model Versions: Replace all rolling model tags with pinned version identifiers in code and configuration.
  • Implement Manifest Schema: Define a TypeScript interface for LLMVersionManifest covering model, prompt, params, RAG config, and eval hash.
  • Hash All Inputs: Compute SHA-256 hashes for prompts, evaluation datasets, and RAG configurations. Include these in the version ID.
  • Immutable Prompt Storage: Migrate prompts from mutable database records to immutable versioned artifacts.
  • Version RAG Pipelines: Create versioned snapshots for vector indices. Link index versions to LLM versions in the manifest.
  • Automate Validation: Add CI/CD checks that validate manifests against policies (e.g., no rolling tags, required eval hash).
  • Bind Evaluations to Versions: Ensure every version is registered with a corresponding evaluation run. Store metrics alongside the manifest.
  • Implement Rollback Testing: Before deploying a new version, run automated regression tests against the previous version's evaluation baseline.

Decision Matrix

ScenarioRecommended ApproachWhyCost Impact
Prototype / MVPPinned model tags + Prompt versioningBalances speed with safety. Avoids rolling updates while allowing rapid iteration.Low
Production RAG AppFull Manifest + Index Versioning + Eval BindingEnsures retrieval consistency and allows precise rollback of the entire pipeline.Medium
Fine-Tuned ModelModel Checkpoint Versioning + Eval Dataset HashFine-tuned models require tracking training data and hyperparameters for reproducibility.High
Multi-Provider SetupAbstraction Layer + Provider-Specific ManifestsNormalizes versioning across providers while capturing provider-specific constraints.Medium
Regulated EnvironmentImmutable Registry + Audit Trail + Signed ManifestsRequired for compliance. Ensures full traceability and tamper-proof version history.High

Configuration Template

Use this JSON schema as a template for version manifests. Store this in your registry or version control.

{
  "versionId": "a1b2c3d4e5f6",
  "provider": "openai",
  "model": "gpt-4",
  "modelVersion": "gpt-4-0613",
  "prompt": {
    "template": "Answer the question based on the context. Question: {{question}} Context: {{context}}",
    "variables": {
      "system_role": "You are a helpful assistant."
    },
    "hash": "e8d5f6a7b8c9d0e1"
  },
  "hyperparameters": {
    "temperature": 0.2,
    "topP": 0.9,
    "maxTokens": 500,
    "seed": 42
  },
  "rag": {
    "embeddingModel": "text-embedding-3-large",
    "chunkSize": 512,
    "similarityThreshold": 0.75,
    "indexVersion": "idx_v2_20240520",
    "hash": "f9a8b7c6d5e4f3a2"
  },
  "evaluation": {
    "datasetHash": "d4e5f6a7b8c9d0e1",
    "datasetName": "qa_benchmark_v3",
    "metrics": {
      "accuracy": 0.92,
      "latency_p99": 1200,
      "cost_per_1k": 0.03
    }
  },
  "metadata": {
    "createdAt": "2024-05-20T10:00:00Z",
    "createdBy": "ci-pipeline",
    "status": "registered"
  }
}

Quick Start Guide

  1. Install Dependencies: Add a hashing library and define the LLMVersionManifest interface in your project.
    npm install crypto
    
  2. Create a Version Manager: Implement the LLMVersionManager class to compute hashes and validate manifests. Use the code provided in the Core Solution.
  3. Define Your First Manifest: Create a JSON file for your initial LLM configuration. Fill in all fields, including pinned model version, prompt template, hyperparameters, and evaluation dataset hash.
  4. Register and Validate: Run the validateManifest function on your configuration. Fix any errors (e.g., rolling tags, missing hashes). Register the valid manifest in your version manager.
  5. Integrate with Inference: Wrap your LLM client with the VersionedLLMClient. Pass the versionId to the invoke method. Ensure all responses include version metadata for tracing.

By implementing this versioning strategy, you eliminate stochastic blind spots, ensure reproducible deployments, and establish a foundation for continuous evaluation and improvement of your LLM applications.

Sources

  • ai-generated