How to Build a Self-Hosted AI Gateway With LiteLLM and Open WebUI
Architecting a Provider-Agnostic AI Routing Layer for Self-Hosted Infrastructure
Current Situation Analysis
The modern AI development stack is inherently fragmented. Teams rarely commit to a single model provider. Instead, they distribute workloads across OpenAI for reasoning, Anthropic for safety-aligned outputs, Groq for low-latency inference, and local Ollama instances for data-sensitive tasks. This distribution creates immediate operational friction. Each provider exposes distinct authentication mechanisms, rate-limiting policies, endpoint structures, and model versioning cycles.
The core pain point isn't building the initial integration; it's maintaining it. When a provider deprecates a model identifier, updates its API schema, or changes token pricing tiers, every downstream application must be patched. Authentication tokens multiply across environments. Load balancing becomes a manual exercise. Over time, the glue code required to route requests, handle fallbacks, and normalize responses consumes more engineering bandwidth than the actual AI features.
This problem is frequently overlooked because early-stage prototypes mask the complexity. Developers wire a single frontend directly to one provider's API, validate functionality, and move on. The architectural debt compounds silently. By the time multiple services depend on the AI layer, swapping providers or adding redundancy requires coordinated deployments across the entire stack. Industry observations show that provider model identifiers change or deprecate on average every 3β6 months. Without a centralized routing abstraction, each change triggers a cascade of configuration updates, integration tests, and potential downtime.
WOW Moment: Key Findings
Introducing a unified proxy layer fundamentally alters the operational profile of an AI infrastructure. By decoupling the consumer interface from the provider backend, you transform volatile external dependencies into stable internal contracts.
| Approach | Endpoint Management | Authentication Overhead | Model Rotation Effort | Security Surface |
|---|---|---|---|---|
| Direct Provider Integration | N separate endpoints | N distinct token sets | Code changes + redeploy per model | Exposed provider keys per service |
| Unified Gateway Architecture | Single OpenAI-compatible endpoint | Centralized token rotation | Config update + container restart | Single ingress point, provider keys isolated |
This finding matters because it shifts AI infrastructure from a brittle, tightly-coupled design to a resilient, plugin-based architecture. The gateway normalizes request/response schemas, handles provider-specific authentication internally, and exposes a consistent OpenAI-compatible interface to all downstream consumers. Model swaps become configuration changes rather than code deployments. Security posture improves by eliminating scattered API keys across multiple services. Operational visibility consolidates into a single observability layer.
Core Solution
The architecture relies on four coordinated components: a routing proxy, a stateful frontend, a persistent database, and a secure network tunnel. Each component serves a specific boundary, and their interaction follows a strict request lifecycle.
1. Infrastructure Orchestration
Docker Compose manages the container lifecycle. The routing proxy and frontend run as independent services, sharing a Docker network for internal communication. PostgreSQL persists chat history, user sessions, and model metadata. Isolating the database ensures that frontend restarts or proxy updates never corrupt conversation state.
2. Gateway Configuration
The proxy acts as an OpenAI-compatible API gateway. It accepts standard /v1/chat/completions requests, authenticates them against a central token, and routes them to the appropriate provider backend based on the requested model alias. The configuration file defines model mappings, provider credentials, and routing rules. Credentials are injected via environment variables, never stored in plaintext configuration files.
3. Frontend Integration
The chat interface connects exclusively to the gateway endpoint. It treats the proxy as a standard OpenAI API server. No provider-specific SDKs or custom routing logic are required in the frontend codebase. Model selection in the UI maps directly to gateway aliases, which the proxy resolves to actual provider endpoints.
4. Secure Network Exposure
Cloudflare Tunnel establishes an outbound-only encrypted connection to Cloudflare's edge network. This eliminates inbound firewall rules, hides the origin IP, and provides automatic TLS termination. The tunnel routes traffic from a public subdomain to the local frontend container, maintaining zero-trust network principles.
Implementation Walkthrough
Step 1: Define the Container Topology
Create a compose-stack.yaml file. The proxy service requires the configuration file mounted as a volume. The frontend service points its API base URL to the proxy's internal hostname. PostgreSQL uses a named volume for data persistence.
version: '3.8'
services:
ai-router:
image: ghcr.io/berriai/litellm:main-latest
container_name: llm-gateway
ports:
- "4000:4000"
volumes:
- ./router-config.yaml:/app/config.yaml
environment:
- ROUTER_AUTH_TOKEN=${ROUTER_AUTH_TOKEN}
- OPENAI_API_KEY=${OPENAI_API_KEY}
- ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
- GROQ_API_KEY=${GROQ_API_KEY}
- OLLAMA_BASE_URL=${OLLAMA_BASE_URL}
depends_on:
- llm-db
chat-frontend:
image: ghcr.io/open-webui/open-webui:main
container_name: ai-chat-ui
ports:
- "3000:8080"
environment:
- OPENAI_API_BASE_URL=http://llm-gateway:4000/v1
- OPENAI_API_KEY=${ROUTER_AUTH_TOKEN}
- WEBUI_SECRET_KEY=${WEBUI_SECRET_KEY}
volumes:
- frontend-data:/app/backend/data
depends_on:
- ai-router
llm-db:
image: postgres:15-alpine
container_name: ai-chat-db
environment:
- POSTGRES_USER=webui_admin
- POSTGRES_PASSWORD=${DB_PASSWORD}
- POSTGRES_DB=chat_state
volumes:
- db-persistence:/var/lib/postgresql/data
volumes:
frontend-data:
db-persistence:
Step 2: Configure the Routing Layer
The router-config.yaml file maps logical model names to provider endpoints. The gateway reads this file on startup and registers the aliases. Authentication is enforced via the master token, which the frontend uses to validate requests.
model_list:
- model_name: fast-reasoning
litellm_params:
model: openai/gpt-4o-mini
api_key: os.environ/OPENAI_API_KEY
- model_name: safety-aligned
litellm_params:
model: anthropic/claude-3-5-sonnet-20241022
api_key: os.environ/ANTHROPIC_API_KEY
- model_name: low-latency
litellm_params:
model: groq/llama-3.1-70b-versatile
api_key: os.environ/GROQ_API_KEY
- model_name: local-inference
litellm_params:
model: ollama_chat/llama3.2
api_base: os.environ/OLLAMA_BASE_URL
general_settings:
master_key: os.environ/ROUTER_AUTH_TOKEN
completion_model: fast-reasoning
ui:
enabled: true
Step 3: Initialize and Validate Export environment variables and launch the stack. Verify container health and test the gateway endpoint before accessing the frontend.
export $(grep -v '^#' .env | xargs)
docker compose -f compose-stack.yaml up -d
# Validate gateway registration
curl -s http://localhost:4000/v1/models \
-H "Authorization: Bearer $ROUTER_AUTH_TOKEN" | jq '.data[].id'
If the response returns the registered aliases, the routing layer is operational. The frontend will automatically populate its model selector with these identifiers.
Step 4: Establish Secure Remote Access Initialize a Cloudflare Tunnel, bind it to a subdomain, and configure the ingress rule to point to the frontend container. Run the tunnel daemon to maintain persistent connectivity.
cloudflared tunnel create ai-gateway
cloudflared tunnel route dns ai-gateway chat.internal.dev
Create ~/.cloudflared/config.yaml:
tunnel: ai-gateway
credentials-file: ~/.cloudflared/<TUNNEL_UUID>.json
ingress:
- hostname: chat.internal.dev
service: http://localhost:3000
- service: http_status:404
Start the daemon:
cloudflared service install
cloudflared service start
Pitfall Guide
1. Hardcoding Provider Model Identifiers
Explanation: Provider model names change frequently due to versioning, deprecation, or rebranding. Embedding them directly in application code or static configuration without version pinning causes silent failures or unexpected behavior. Fix: Use logical aliases in the gateway config. Map aliases to fully qualified provider model strings. Implement a periodic validation script that checks provider API endpoints for available models and alerts on mismatches.
2. Docker Volume Mount Typos
Explanation: Mounting a configuration file with a trailing slash or incorrect path causes Docker to create a directory instead of binding the file. The gateway starts with empty or default configuration, leading to authentication failures or missing models.
Fix: Always verify mount paths with docker inspect <container> | grep -A 5 Mounts. Use absolute paths or explicit relative paths without trailing slashes. Validate file existence before starting the stack.
3. Scattered API Key Management
Explanation: Distributing provider keys across multiple services increases the blast radius of a credential leak. It also complicates rotation, as each service must be updated independently. Fix: Centralize all provider credentials in the gateway's environment. Downstream services only require the gateway's master token. Implement secret rotation procedures that update the gateway environment and restart the container, leaving downstream configs untouched.
4. Ignoring Connection Timeouts and Retries
Explanation: Provider APIs experience transient failures, rate limits, or latency spikes. Without timeout configuration and retry logic, requests hang or fail immediately, degrading user experience.
Fix: Configure gateway-level timeout thresholds and retry policies. Set request_timeout and max_retries in the routing configuration. Implement circuit breaker patterns for non-critical fallback models to prevent cascading failures.
5. Running Tunnels as Foreground Processes
Explanation: Executing cloudflared tunnel run in a terminal session ties the network exposure to that session. Closing the terminal or experiencing a shell crash severs connectivity without warning.
Fix: Install the tunnel as a system service. Use cloudflared service install and enable it via systemd or launchd. Monitor service health with systemctl status cloudflared or equivalent daemon managers.
6. Overlooking Database Volume Persistence
Explanation: Failing to declare named volumes for PostgreSQL causes container recreation to wipe chat history, user accounts, and model preferences. This is a common oversight in ephemeral development setups.
Fix: Explicitly define named volumes in the compose file. Backup volume data periodically using docker run --rm -v <volume_name>:/data -v $(pwd):/backup alpine tar czf /backup/db-backup.tar.gz -C /data ..
7. Misaligned CORS and Proxy Headers
Explanation: The frontend expects specific headers from the gateway. If the tunnel or reverse proxy strips or modifies Authorization, Content-Type, or Origin headers, authentication fails or requests are rejected.
Fix: Verify tunnel ingress configuration preserves headers. Test with curl -v to inspect request/response headers. Ensure the gateway's CORS settings allow the frontend's origin domain.
Production Bundle
Action Checklist
- Define logical model aliases in the gateway configuration before deployment
- Inject all provider credentials via environment variables, never in config files
- Verify Docker volume mounts resolve to files, not directories
- Configure explicit timeouts and retry policies in the routing layer
- Install the network tunnel as a background service, not a terminal process
- Declare persistent named volumes for all stateful databases
- Validate gateway
/v1/modelsendpoint before connecting downstream services - Implement automated credential rotation procedures with zero-downtime restarts
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|---|---|---|
| Single provider, low traffic | Direct API integration | Simpler setup, fewer moving parts | Lower infrastructure cost, higher maintenance overhead |
| Multi-provider, frequent model swaps | Unified gateway architecture | Centralized routing, config-only model changes | Moderate infrastructure cost, significantly lower operational overhead |
| Internal team tooling | Gateway + local tunnel | Secure access without public exposure, centralized auth | Minimal cost, high security posture |
| Public-facing AI product | Gateway + CDN + WAF | DDoS protection, rate limiting, global edge caching | Higher infrastructure cost, enterprise-grade reliability |
Configuration Template
# .env
ROUTER_AUTH_TOKEN=sk-prod-<strong-random-hex>
OPENAI_API_KEY=sk-<provider-key>
ANTHROPIC_API_KEY=sk-ant-<provider-key>
GROQ_API_KEY=gsk_<provider-key>
OLLAMA_BASE_URL=http://host.docker.internal:11434/v1
WEBUI_SECRET_KEY=<32-char-random-string>
DB_PASSWORD=<strong-db-password>
# router-config.yaml
model_list:
- model_name: primary
litellm_params:
model: openai/gpt-4o
api_key: os.environ/OPENAI_API_KEY
- model_name: alternative
litellm_params:
model: anthropic/claude-3-5-haiku-20241022
api_key: os.environ/ANTHROPIC_API_KEY
- model_name: local
litellm_params:
model: ollama_chat/mistral
api_base: os.environ/OLLAMA_BASE_URL
general_settings:
master_key: os.environ/ROUTER_AUTH_TOKEN
completion_model: primary
ui:
enabled: true
request_timeout: 60
max_retries: 3
Quick Start Guide
- Prepare credentials: Create a
.envfile with provider keys and a strong gateway master token. Never commit this file. - Launch containers: Run
docker compose -f compose-stack.yaml up -d. Wait for PostgreSQL initialization and gateway startup. - Validate routing: Execute
curl -s http://localhost:4000/v1/models -H "Authorization: Bearer $ROUTER_AUTH_TOKEN" | jq '.data[].id'. Confirm all aliases appear. - Access frontend: Open
http://localhost:3000in a browser. Select a model from the dropdown and send a test prompt. - Expose securely: Initialize a Cloudflare Tunnel, bind it to your subdomain, and start the daemon service. Verify connectivity via
curl -I https://<your-subdomain>.
Mid-Year Sale β Unlock Full Article
Base plan from just $4.99/mo or $49/yr
Sign in to read the full article and unlock all tutorials.
Sign In / Register β Start Free Trial7-day free trial Β· Cancel anytime Β· 30-day money-back
