Standardizing LLM Integration in Ruby: A Provider-Agnostic Architecture

Current Situation Analysis

The modern AI ecosystem is characterized by rapid model iteration and fragmented API standards. While the industry celebrates breakthroughs in reasoning, vision, and code generation, the underlying integration layer remains notoriously inconsistent. Development teams frequently assume that "OpenAI-compatible" endpoints guarantee drop-in interoperability. In practice, this assumption introduces severe architectural debt.

The core friction lies in response normalization. Providers diverge across multiple dimensions:

Streaming protocols: Server-Sent Events (SSE), newline-delimited JSON (NDJSON), raw TCP deltas, or partial JSON fragments
Lifecycle signals: Completion markers vary between finish_reason: "stop", done: true, is_last_chunk: true, or explicit empty payloads
Error schemas: Rate limits, context window overflows, and model unavailability return different HTTP status codes and payload structures
Retry semantics: Some endpoints expect idempotency keys, others rely on exponential backoff, and many silently drop connections without proper closure

When developers treat LLM integration as a simple HTTP wrapper, application logic becomes tightly coupled to provider-specific parsing, error handling, and streaming mechanics. This coupling creates three compounding problems:

Vendor lock-in: Swapping models requires rewriting parsing logic, retry policies, and stream consumers
Fragile observability: Metrics, tracing, and logging must be duplicated across every provider implementation
Development friction: Local runtimes like Ollama and LM Studio introduce offline workflows that clash with hosted API assumptions

The industry overlooks this because prompt engineering and model selection dominate technical discussions. Integration infrastructure is treated as a transient concern rather than a foundational layer. Yet, production systems that handle thousands of daily inferences quickly expose the cost of unnormalized AI calls. The solution requires treating LLM integration with the same architectural rigor applied to database drivers or message queue clients.

WOW Moment: Key Findings

The architectural shift from direct API consumption to a unified gateway layer fundamentally changes where complexity lives. Instead of scattering provider-specific logic across controllers, services, and background jobs, complexity concentrates in a single, testable infrastructure module.

Approach	Implementation Complexity	Streaming Overhead	Vendor Lock-in Risk	Error Coverage	Maintenance Burden
Direct HTTP Integration	High (per-provider)	High (manual parsing)	Critical	Fragmented	Linear growth
Unified Gateway Architecture	Medium (initial)	Low (normalized)	Minimal	Centralized	Constant

This finding matters because it enables three critical production capabilities:

Seamless provider rotation: Switch between Ollama for local development, DeepSeek for cost optimization, and hosted APIs for peak demand without touching business logic
Consistent observability: Single point for latency tracking, token accounting, and error classification
Predictable streaming: Application code consumes a uniform ResponseEnvelope regardless of whether the underlying provider uses SSE, NDJSON, or raw chunked transfer encoding

The abstraction layer transforms AI integration from a recurring integration task into a stable platform capability.

Core Solution

Building a provider-agnostic LLM layer requires three architectural pillars: a registry-based adapter system, an enumerator-driven streaming processor, and a middleware stack for retries and observability. The implementation follows Ruby idioms while maintaining strict separation between transport logic and application concerns.

Step 1: Define the Unified Interface

The gateway exposes two primary entry points: synchronous completion and streaming completion. Both return a normalized response envelope that abstracts provider differences.

module LlmGateway
  class << self
    def complete(params)
      adapter = ProviderRegistry.resolve(params[:provider])
      request = RequestBuilder.new(adapter, params)
      ResponseParser.parse(adapter.execute(request))
    end

    def stream(params, &block)
      adapter = ProviderRegistry.resolve(params[:provider])
      request = RequestBuilder.new(adapter, params)
      StreamProcessor.new(adapter, request, &block).run
    end
  end
end

Step 2: Implement the Provider Registry

The registry decouples application code from concrete adapter implementations. It uses a strategy pattern to route requests to the correct HTTP client and parser.

class ProviderRegistry
  ADAPTERS = {
    ollama: OllamaAdapter,
    lm_studio: LmStudioAdapter,
    deepseek: DeepSeekAdapter,
    openai: OpenAiCompatibleAdapter
  }.freeze

  def self.resolve(provider_key)
    adapter_class = ADAPTERS.fetch(provider_key) do
      raise UnknownProviderError, "No adapter registered for #{provider_key}"
    end
    adapter_class.new
  end
end

Each adapter implements a minimal contract: build_url, build_headers, execute, and parse_stream. This contract ensures transport consistency while allowing provider-specific optimizations.

Step 3: Normalize Streaming with Enumerators

Streaming is the most complex integration surface. Ruby's Enumerator provides backpressure control and lazy evaluation, making it superior to callback-heavy approaches for production workloads.

class StreamProcessor
  def initialize(adapter, request, &block)
    @adapter = adapter
    @request = request
    @consumer = block || ->(chunk) { puts chunk }
    @buffer = String.new
  end

  def run
    Enumerator.new do |yielder|
      @adapter.stream(@request) do |raw_chunk|
        @buffer << raw_chunk
        while (token = extract_token)
          envelope = ResponseEnvelope.new(content: token, provider: @adapter.name)
          yielder << envelope
          @consumer.call(envelope)
        end
      end
    end
  end

  private

  def extract_token
    return nil unless @buffer.include?("\n")
    line, @buffer = @buffer.split("\n", 2)
    @adapter.parse_line(line)
  end
end

The processor accumulates raw bytes, splits on provider-specific delimiters, and yields normalized ResponseEnvelope objects. This design handles SSE data prefixes, NDJSON formatting, and partial JSON fragments through a single parsing interface.

Step 4: Add Retry Middleware with Idempotency

Network instability and rate limits require robust retry logic. The middleware layer intercepts failures, applies exponential backoff, and tracks attempt metadata.

class RetryMiddleware
  MAX_ATTEMPTS = 3
  BASE_DELAY = 0.5

  def initialize(adapter)
    @adapter = adapter
  end

  def execute(request)
    attempt = 0
    begin
      attempt += 1
      @adapter.execute(request)
    rescue RateLimitError, ServiceUnavailableError => e
      raise e if attempt >= MAX_ATTEMPTS
      delay = BASE_DELAY * (2 ** (attempt - 1))
      sleep(delay)
      retry
    end
  end
end

The middleware respects provider-specific error codes while maintaining a uniform retry policy. Idempotency keys should be injected at the request builder level to prevent duplicate completions during retries.

Architecture Rationale

Registry over inheritance: Composition allows runtime provider switching without class hierarchy constraints
Enumerator over callbacks: Lazy evaluation prevents memory bloat during long streams and enables backpressure
Envelope pattern: Normalizes metadata (tokens used, finish reason, provider name) alongside content
Middleware stack: Separates cross-cutting concerns (retries, logging, tracing) from transport logic

This architecture mirrors database connection pooling and ORM design patterns, treating LLM providers as pluggable data sources rather than external dependencies.

Pitfall Guide

1. Assuming Uniform Streaming Formats

Explanation: Developers often write parsers that expect a single chunk structure. Providers mix SSE prefixes (data: ), NDJSON arrays, and raw text deltas. Fix: Implement a line-buffered parser that strips protocol prefixes before JSON deserialization. Validate chunk structure before processing.

2. Blocking the Main Thread During Stream Consumption

Explanation: Synchronous stream consumers block the calling thread, causing request timeouts in web frameworks. Fix: Use Enumerator with lazy evaluation. Offload heavy processing to background workers. Implement backpressure by pausing the enumerator when downstream systems are saturated.

3. Ignoring Idempotency in Retry Logic

Explanation: Retrying non-idempotent requests duplicates completions, inflating token costs and causing inconsistent state. Fix: Generate UUID-based idempotency keys at the request builder level. Pass keys through headers. Cache successful responses to prevent duplicate execution.

4. Hardcoding Provider-Specific Timeouts

Explanation: Local runtimes like Ollama require longer timeouts than hosted APIs. Global timeout settings cause premature failures or unnecessary delays. Fix: Configure timeouts per provider in the registry. Use connection timeout for handshake and read timeout for streaming. Implement circuit breakers for degraded providers.

5. Neglecting Token Boundary Awareness

Explanation: Streaming chunks split UTF-8 sequences and multi-byte characters, causing corrupted output when concatenated naively. Fix: Use a byte-aware buffer that validates UTF-8 completeness before yielding. Implement character boundary detection for non-ASCII content.

6. Skipping Structured Error Normalization

Explanation: Provider errors return inconsistent schemas, making monitoring and alerting difficult. Fix: Map all provider errors to a unified LlmError hierarchy. Include provider name, HTTP status, raw payload, and retry count in error metadata.

7. Overlooking Observability Hooks

Explanation: Without centralized logging, tracking latency, token usage, and failure rates across providers becomes impossible. Fix: Instrument the gateway with ActiveSupport::Notifications or OpenTelemetry spans. Log request/response metadata at debug level. Expose metrics for Prometheus/Grafana dashboards.

Production Bundle

Action Checklist

Register all target providers in the registry with explicit timeout and retry configurations
Implement a line-buffered stream parser that handles SSE, NDJSON, and raw deltas
Add idempotency key generation to the request builder for safe retry execution
Configure per-provider circuit breakers to prevent cascade failures during outages
Instrument the gateway with structured logging and token accounting metrics
Write integration tests against mock adapters to verify stream normalization and error handling
Implement fallback routing for critical paths when primary providers degrade

Decision Matrix

Scenario	Recommended Approach	Why	Cost Impact
Local Development	Ollama/LM Studio via unified gateway	Zero API costs, offline capability, fast iteration	None
High-Throughput Production	Hosted API with retry middleware & connection pooling	Scalability, SLA guarantees, global latency optimization	Moderate (token-based)
Cost-Sensitive Workloads	DeepSeek or open-weight models via gateway	Lower per-token pricing, acceptable latency trade-off	Low
Compliance-Heavy Environments	On-premise runtime with air-gapped registry	Data sovereignty, audit trails, no external egress	High (infrastructure)

Configuration Template

# config/llm_gateway.yml
default: &default
  max_retries: 3
  base_retry_delay: 0.5
  stream_buffer_size: 4096
  timeout:
    connect: 5
    read: 30
    write: 10

providers:
  ollama:
    base_url: "http://localhost:11434/v1"
    default_model: "llama3.2"
    stream_format: :ndjson
    timeout:
      read: 60

  lm_studio:
    base_url: "http://localhost:1234/v1"
    default_model: "tinyllama-1.1b-chat-v1.0"
    stream_format: :sse
    timeout:
      read: 45

  deepseek:
    base_url: "https://api.deepseek.com/v1"
    default_model: "deepseek-chat"
    stream_format: :sse
    auth_header: "Authorization"
    timeout:
      read: 20

  openai:
    base_url: "https://api.openai.com/v1"
    default_model: "gpt-4o-mini"
    stream_format: :sse
    auth_header: "Authorization"
    timeout:
      read: 15

Quick Start Guide

Initialize the registry: Load provider configurations from YAML and instantiate adapter classes with environment-specific credentials.
Configure streaming behavior: Set buffer sizes, timeout thresholds, and retry policies in the gateway initializer.
Execute a synchronous call: Use LlmGateway.complete(provider: :ollama, model: "llama3.2", messages: [{role: "user", content: "Hello"}]) to receive a normalized ResponseEnvelope.
Consume a stream: Call LlmGateway.stream(provider: :deepseek, model: "deepseek-chat", messages: [...]) { |chunk| process(chunk.content) } to handle real-time deltas with automatic backpressure and error recovery.
Verify observability: Check logs for structured metadata including provider name, token counts, latency, and retry attempts. Adjust circuit breaker thresholds based on production traffic patterns.

Building a Rails-Native AI Abstraction Layer for Local and Hosted LLMs