Building a Rails-Native AI Abstraction Layer for Local and Hosted LLMs
Standardizing LLM Integration in Ruby: A Provider-Agnostic Architecture
Current Situation Analysis
The modern AI ecosystem is characterized by rapid model iteration and fragmented API standards. While the industry celebrates breakthroughs in reasoning, vision, and code generation, the underlying integration layer remains notoriously inconsistent. Development teams frequently assume that "OpenAI-compatible" endpoints guarantee drop-in interoperability. In practice, this assumption introduces severe architectural debt.
The core friction lies in response normalization. Providers diverge across multiple dimensions:
- Streaming protocols: Server-Sent Events (SSE), newline-delimited JSON (NDJSON), raw TCP deltas, or partial JSON fragments
- Lifecycle signals: Completion markers vary between
finish_reason: "stop",done: true,is_last_chunk: true, or explicit empty payloads - Error schemas: Rate limits, context window overflows, and model unavailability return different HTTP status codes and payload structures
- Retry semantics: Some endpoints expect idempotency keys, others rely on exponential backoff, and many silently drop connections without proper closure
When developers treat LLM integration as a simple HTTP wrapper, application logic becomes tightly coupled to provider-specific parsing, error handling, and streaming mechanics. This coupling creates three compounding problems:
- Vendor lock-in: Swapping models requires rewriting parsing logic, retry policies, and stream consumers
- Fragile observability: Metrics, tracing, and logging must be duplicated across every provider implementation
- Development friction: Local runtimes like Ollama and LM Studio introduce offline workflows that clash with hosted API assumptions
The industry overlooks this because prompt engineering and model selection dominate technical discussions. Integration infrastructure is treated as a transient concern rather than a foundational layer. Yet, production systems that handle thousands of daily inferences quickly expose the cost of unnormalized AI calls. The solution requires treating LLM integration with the same architectural rigor applied to database drivers or message queue clients.
WOW Moment: Key Findings
The architectural shift from direct API consumption to a unified gateway layer fundamentally changes where complexity lives. Instead of scattering provider-specific logic across controllers, services, and background jobs, complexity concentrates in a single, testable infrastructure module.
| Approach | Implementation Complexity | Streaming Overhead | Vendor Lock-in Risk | Error Coverage | Maintenance Burden |
|---|---|---|---|---|---|
| Direct HTTP Integration | High (per-provider) | High (manual parsing) | Critical | Fragmented | Linear growth |
| Unified Gateway Architecture | Medium (initial) | Low (normalized) | Minimal | Centralized | Constant |
This finding matters because it enables three critical production capabilities:
- Seamless provider rotation: Switch between Ollama for local development, DeepSeek for cost optimization, and hosted APIs for peak demand without touching business logic
- Consistent observability: Single point for latency tracking, token accounting, and error classification
- Predictable streaming: Application code consumes a uniform
ResponseEnveloperegardless of whether the underlying provider uses SSE, NDJSON, or raw chunked transfer encoding
The abstraction layer transforms AI integration from a recurring integration task into a stable platform capability.
Core Solution
Building a provider-agnostic LLM layer requires three architectural pillars: a registry-based adapter system, an enumerator-driven streaming processor, and a middleware stack for retries and observability. The implementation follows Ruby idioms while maintaining strict separation between transport logic and application concerns.
Step 1: Define the Unified Interface
The gateway exposes two primary entry points: synchronous completion and streaming completion. Both return a normalized response envelope that abstracts provider differences.
module LlmGateway
class << self
def complete(params)
adapter = ProviderRegistry.resolve(params[:provider])
request = RequestBuilder.new(adapter, params)
ResponseParser.parse(adapter.execute(request))
end
def stream(params, &block)
adapter = ProviderRegistry.resolve(params[:provider])
request = RequestBuilder.new(adapter, params)
StreamProcessor.new(adapter, request, &block).run
end
end
end
Step 2: Implement the Provider Registry
The registry decouples application code from concrete adapter implementations. It uses a strategy pattern to route requests to the correct HTTP client and parser.
class ProviderRegistry
ADAPTERS = {
ollama: OllamaAdapter,
lm_studio: LmStudioAdapter,
deepseek: DeepSeekAdapter,
openai: OpenAiCompatibleAdapter
}.freeze
def self.resolve(provider_key)
adapter_class = ADAPTERS.fetch(provider_key) do
raise UnknownProviderError, "No adapter registered for #{provider_key}"
end
adapter_class.new
end
end
Each adapter implements a minimal contract: build_url, build_headers, execute, and parse_stream. This contract ensures transport consistency while allowing provider-specific optimizations.
Step 3: Normalize Streaming with Enumerators
Streaming is the most complex integration surface. Ruby's Enumerator provides backpressure control and lazy evaluation, making it superior to callback-heavy approaches for production workloads.
class StreamProcessor
def initialize(adapter, request, &block)
@adapter = adapter
@request = request
@consumer = block || ->(chunk) { puts chunk }
@buffer = String.new
end
def run
Enumerator.new do |yielder|
@adapter.stream(@request) do |raw_chunk|
@buffer << raw_chunk
while (token = extract_token)
envelope = ResponseEnvelope.new(content: token, provider: @adapter.name)
yielder << envelope
@consumer.call(envelope)
end
end
end
end
private
def extract_token
return nil unless @buffer.include?("\n")
line, @buffer = @buffer.split("\n", 2)
@adapter.parse_line(line)
end
end
The processor accumulates raw bytes, splits on provider-specific delimiters, and yields normalized ResponseEnvelope objects. This design handles SSE data prefixes, NDJSON formatting, and partial JSON fragments through a single parsing interface.
Step 4: Add Retry Middleware with Idempotency
Network instability and rate limits require robust retry logic. The middleware layer intercepts failures, applies exponential backoff, and tracks attempt metadata.
class RetryMiddleware
MAX_ATTEMPTS = 3
BASE_DELAY = 0.5
def initialize(adapter)
@adapter = adapter
end
def execute(request)
attempt = 0
begin
attempt += 1
@adapter.execute(request)
rescue RateLimitError, ServiceUnavailableError => e
raise e if attempt >= MAX_ATTEMPTS
delay = BASE_DELAY * (2 ** (attempt - 1))
sleep(delay)
retry
end
end
end
The middleware respects provider-specific error codes while maintaining a uniform retry policy. Idempotency keys should be injected at the request builder level to prevent duplicate completions during retries.
Architecture Rationale
- Registry over inheritance: Composition allows runtime provider switching without class hierarchy constraints
- Enumerator over callbacks: Lazy evaluation prevents memory bloat during long streams and enables backpressure
- Envelope pattern: Normalizes metadata (tokens used, finish reason, provider name) alongside content
- Middleware stack: Separates cross-cutting concerns (retries, logging, tracing) from transport logic
This architecture mirrors database connection pooling and ORM design patterns, treating LLM providers as pluggable data sources rather than external dependencies.
Pitfall Guide
1. Assuming Uniform Streaming Formats
Explanation: Developers often write parsers that expect a single chunk structure. Providers mix SSE prefixes (data: ), NDJSON arrays, and raw text deltas.
Fix: Implement a line-buffered parser that strips protocol prefixes before JSON deserialization. Validate chunk structure before processing.
2. Blocking the Main Thread During Stream Consumption
Explanation: Synchronous stream consumers block the calling thread, causing request timeouts in web frameworks.
Fix: Use Enumerator with lazy evaluation. Offload heavy processing to background workers. Implement backpressure by pausing the enumerator when downstream systems are saturated.
3. Ignoring Idempotency in Retry Logic
Explanation: Retrying non-idempotent requests duplicates completions, inflating token costs and causing inconsistent state. Fix: Generate UUID-based idempotency keys at the request builder level. Pass keys through headers. Cache successful responses to prevent duplicate execution.
4. Hardcoding Provider-Specific Timeouts
Explanation: Local runtimes like Ollama require longer timeouts than hosted APIs. Global timeout settings cause premature failures or unnecessary delays. Fix: Configure timeouts per provider in the registry. Use connection timeout for handshake and read timeout for streaming. Implement circuit breakers for degraded providers.
5. Neglecting Token Boundary Awareness
Explanation: Streaming chunks split UTF-8 sequences and multi-byte characters, causing corrupted output when concatenated naively. Fix: Use a byte-aware buffer that validates UTF-8 completeness before yielding. Implement character boundary detection for non-ASCII content.
6. Skipping Structured Error Normalization
Explanation: Provider errors return inconsistent schemas, making monitoring and alerting difficult.
Fix: Map all provider errors to a unified LlmError hierarchy. Include provider name, HTTP status, raw payload, and retry count in error metadata.
7. Overlooking Observability Hooks
Explanation: Without centralized logging, tracking latency, token usage, and failure rates across providers becomes impossible. Fix: Instrument the gateway with ActiveSupport::Notifications or OpenTelemetry spans. Log request/response metadata at debug level. Expose metrics for Prometheus/Grafana dashboards.
Production Bundle
Action Checklist
- Register all target providers in the registry with explicit timeout and retry configurations
- Implement a line-buffered stream parser that handles SSE, NDJSON, and raw deltas
- Add idempotency key generation to the request builder for safe retry execution
- Configure per-provider circuit breakers to prevent cascade failures during outages
- Instrument the gateway with structured logging and token accounting metrics
- Write integration tests against mock adapters to verify stream normalization and error handling
- Implement fallback routing for critical paths when primary providers degrade
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|---|---|---|
| Local Development | Ollama/LM Studio via unified gateway | Zero API costs, offline capability, fast iteration | None |
| High-Throughput Production | Hosted API with retry middleware & connection pooling | Scalability, SLA guarantees, global latency optimization | Moderate (token-based) |
| Cost-Sensitive Workloads | DeepSeek or open-weight models via gateway | Lower per-token pricing, acceptable latency trade-off | Low |
| Compliance-Heavy Environments | On-premise runtime with air-gapped registry | Data sovereignty, audit trails, no external egress | High (infrastructure) |
Configuration Template
# config/llm_gateway.yml
default: &default
max_retries: 3
base_retry_delay: 0.5
stream_buffer_size: 4096
timeout:
connect: 5
read: 30
write: 10
providers:
ollama:
base_url: "http://localhost:11434/v1"
default_model: "llama3.2"
stream_format: :ndjson
timeout:
read: 60
lm_studio:
base_url: "http://localhost:1234/v1"
default_model: "tinyllama-1.1b-chat-v1.0"
stream_format: :sse
timeout:
read: 45
deepseek:
base_url: "https://api.deepseek.com/v1"
default_model: "deepseek-chat"
stream_format: :sse
auth_header: "Authorization"
timeout:
read: 20
openai:
base_url: "https://api.openai.com/v1"
default_model: "gpt-4o-mini"
stream_format: :sse
auth_header: "Authorization"
timeout:
read: 15
Quick Start Guide
- Initialize the registry: Load provider configurations from YAML and instantiate adapter classes with environment-specific credentials.
- Configure streaming behavior: Set buffer sizes, timeout thresholds, and retry policies in the gateway initializer.
- Execute a synchronous call: Use
LlmGateway.complete(provider: :ollama, model: "llama3.2", messages: [{role: "user", content: "Hello"}])to receive a normalizedResponseEnvelope. - Consume a stream: Call
LlmGateway.stream(provider: :deepseek, model: "deepseek-chat", messages: [...]) { |chunk| process(chunk.content) }to handle real-time deltas with automatic backpressure and error recovery. - Verify observability: Check logs for structured metadata including provider name, token counts, latency, and retry attempts. Adjust circuit breaker thresholds based on production traffic patterns.
Mid-Year Sale β Unlock Full Article
Base plan from just $4.99/mo or $49/yr
Sign in to read the full article and unlock all tutorials.
Sign In / Register β Start Free Trial7-day free trial Β· Cancel anytime Β· 30-day money-back
