tall-recommendsstrips unnecessary system packages. Installingca-certificatesensures TLS handshakes with external APIs succeed. Layeringrequirements.txt` before application code ensures that dependency installation is cached unless the manifest changes, drastically reducing CI build times.
Step 2: Async API Layer Design
LLM inference is inherently I/O bound. Synchronous blocking calls will exhaust worker threads and degrade throughput. The following implementation uses httpx with connection pooling and explicit timeout boundaries.
# src/gateway.py
import os
import logging
from typing import List, Optional
from fastapi import FastAPI, HTTPException, Depends
from pydantic import BaseModel, Field
from httpx import AsyncClient, Timeout, HTTPStatusError
from config import ServiceConfig
logger = logging.getLogger(__name__)
app = FastAPI(title="LLM Inference Gateway", version="2.1.0")
class Turn(BaseModel):
role: str
text: str
class InferencePayload(BaseModel):
turns: List[Turn]
target_model: str = Field(default="claude-3-5-sonnet-20241022")
max_output_tokens: int = Field(default=1024, le=4096)
sampling_temperature: float = Field(default=0.7, ge=0.0, le=2.0)
async def get_http_client() -> AsyncClient:
"""Provides a pooled async client with strict timeout boundaries."""
timeout = Timeout(connect=5.0, read=120.0, write=10.0, pool=10.0)
async with AsyncClient(timeout=timeout) as client:
yield client
@app.get("/ready")
async def readiness_probe():
return {"status": "operational"}
@app.post("/v1/infer")
async def run_inference(payload: InferencePayload, client: AsyncClient = Depends(get_http_client)):
config = ServiceConfig()
request_body = {
"model": payload.target_model,
"messages": [{"role": t.role, "content": t.text} for t in payload.turns],
"max_tokens": payload.max_output_tokens,
"temperature": payload.sampling_temperature
}
try:
response = await client.post(
url="https://api.ofox.ai/v1/chat/completions",
headers={
"Authorization": f"Bearer {config.provider_token}",
"Content-Type": "application/json"
},
json=request_body
)
response.raise_for_status()
result = response.json()
return {
"generated_text": result["choices"][0]["message"]["content"],
"model_version": result["model"],
"token_consumption": result["usage"]["total_tokens"]
}
except HTTPStatusError as exc:
logger.error(f"Provider API error: {exc.response.status_code} - {exc.response.text}")
raise HTTPException(status_code=exc.response.status_code, detail="Upstream inference failed")
except Exception as exc:
logger.exception("Unexpected gateway failure")
raise HTTPException(status_code=500, detail="Internal processing error")
Rationale: Dependency injection via Depends ensures the HTTP client is properly scoped and closed. Strict timeout boundaries prevent indefinite hangs if the upstream provider experiences latency spikes. Pydantic field constraints (le, ge) validate input before it reaches the network layer, reducing unnecessary API calls. Structured logging captures upstream errors without leaking sensitive headers.
Step 3: Environment-Driven Configuration
Hardcoded values create deployment friction. Externalize configuration using environment variables with strong typing.
# src/config.py
import os
from dataclasses import dataclass
@dataclass(frozen=True)
class ServiceConfig:
provider_token: str
default_model: str
max_concurrent_requests: int
log_level: str
def __post_init__(self):
if not self.provider_token:
raise ValueError("PROVIDER_AUTH_TOKEN must be set")
def load_config() -> ServiceConfig:
return ServiceConfig(
provider_token=os.environ.get("PROVIDER_AUTH_TOKEN", ""),
default_model=os.environ.get("TARGET_MODEL", "claude-3-5-sonnet-20241022"),
max_concurrent_requests=int(os.environ.get("MAX_CONCURRENCY", "50")),
log_level=os.environ.get("LOG_LEVEL", "INFO")
)
Rationale: Freezing the dataclass prevents accidental mutation at runtime. Validation in __post_init__ fails fast during container startup rather than during the first request. Environment variables align with the 12-factor app methodology, enabling seamless rotation of secrets without rebuilding images.
Step 4: Orchestration & Health Monitoring
Docker Compose ties the API service to supporting infrastructure while enforcing lifecycle policies.
# docker-compose.yml
version: "3.9"
services:
inference-gateway:
build:
context: .
dockerfile: Dockerfile
ports:
- "8080:8080"
environment:
- PROVIDER_AUTH_TOKEN=${PROVIDER_AUTH_TOKEN}
- TARGET_MODEL=claude-3-5-sonnet-20241022
- LOG_LEVEL=INFO
restart: on-failure
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080/ready"]
interval: 20s
timeout: 5s
retries: 3
start_period: 10s
deploy:
resources:
limits:
cpus: "2.0"
memory: 2G
reservations:
cpus: "0.5"
memory: 512M
cache-layer:
image: redis:7-alpine
ports:
- "6379:6379"
volumes:
- cache-storage:/data
command: redis-server --save 60 1 --loglevel warning
volumes:
cache-storage:
Rationale: The healthcheck directive enables orchestrators to detect degraded states and restart unhealthy containers automatically. Resource limits prevent memory exhaustion during high-concurrency inference bursts. Redis is configured with minimal persistence (save 60 1) to balance durability and write amplification. The start_period gives the Python runtime adequate time to initialize before health probes begin.
Pitfall Guide
1. CUDA/Driver Version Mismatch
Explanation: Pulling a generic Python image and installing PyTorch or TensorFlow without matching the host's NVIDIA driver version causes silent segmentation faults or fallback to CPU execution.
Fix: Use NVIDIA's official CUDA base images (nvidia/cuda:12.1.0-base-ubuntu22.04) and verify driver compatibility with nvidia-smi. Pin CUDA toolkit versions in requirements.txt using --extra-index-url when necessary.
2. Bloated Final Images
Explanation: Including build tools, documentation, and test dependencies in the production image increases pull times, storage costs, and attack surface.
Fix: Implement multi-stage builds. Compile dependencies in a builder stage, then copy only the runtime artifacts to a minimal slim or alpine final stage. Remove apt caches and pip wheels after installation.
3. Synchronous Blocking in Async Endpoints
Explanation: Using requests or synchronous database calls inside FastAPI async routes blocks the event loop, causing request queuing and timeout cascades under load.
Fix: Replace all I/O operations with async equivalents (httpx, asyncpg, aioredis). Use asyncio.to_thread only for CPU-bound legacy code that cannot be refactored.
4. Hardcoded Secrets & Missing Validation
Explanation: Embedding API keys in source code or Dockerfiles exposes credentials in version control and image layers.
Fix: Inject secrets via environment variables or orchestration secret managers. Validate presence at startup. Never log request headers containing authorization tokens.
5. Unbounded Memory/CPU Allocation
Explanation: Containers without resource limits can consume all host resources, causing OOM kills that affect neighboring services or the host OS.
Fix: Define explicit deploy.resources.limits in compose files or Kubernetes manifests. Monitor actual usage with docker stats and adjust limits based on p95 memory consumption during load testing.
6. Ignoring Connection Pool Exhaustion
Explanation: Creating a new HTTP client per request or failing to close connections leads to file descriptor leaks and port exhaustion.
Fix: Reuse httpx.AsyncClient instances via dependency injection or singleton patterns. Configure limits=httpx.Limits(max_connections=100, max_keepalive_connections=20) to match expected concurrency.
7. No Graceful Shutdown Handling
Explanation: Abrupt container termination drops in-flight inference requests, causing client retries and upstream rate limit violations.
Fix: Implement signal handlers for SIGTERM. Drain active connections, flush logs, and close database/client pools before exiting. FastAPI handles this automatically when configured with --timeout-keep-alive and proper orchestration termination grace periods.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Low-traffic prototype | Single-stage container with python:slim | Faster iteration, simpler debugging | Low storage, higher cold start |
| Production LLM gateway | Multi-stage build + async httpx + connection pooling | Reduced image size, predictable latency, resource isolation | Moderate build complexity, lower runtime costs |
| GPU-accelerated local inference | NVIDIA CUDA base image + deploy.resources.devices | Direct hardware access, driver compatibility | Higher cloud VM costs, requires compatible host |
| High-concurrency API proxy | Redis cache layer + request deduplication | Reduces upstream API calls, improves p99 latency | Additional infrastructure cost, offset by API savings |
| Multi-region deployment | Registry mirroring + healthcheck routing | Faster global pulls, automatic failover | Increased registry storage, improved availability |
Configuration Template
# Dockerfile.production
FROM python:3.11-slim AS runtime
WORKDIR /opt/app
RUN apt-get update && apt-get install -y --no-install-recommends curl ca-certificates && \
rm -rf /var/lib/apt/lists/*
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY src/ ./src/
COPY config.yaml .
ENV PYTHONUNBUFFERED=1
EXPOSE 8080
HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \
CMD curl -f http://localhost:8080/ready || exit 1
CMD ["uvicorn", "src.gateway:app", "--host", "0.0.0.0", "--port", "8080", "--workers", "4"]
# docker-compose.prod.yml
version: "3.9"
services:
llm-gateway:
build:
context: .
dockerfile: Dockerfile.production
environment:
- PROVIDER_AUTH_TOKEN=${PROVIDER_AUTH_TOKEN}
- LOG_LEVEL=WARNING
deploy:
resources:
limits:
cpus: "4.0"
memory: 4G
restart: unless-stopped
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080/ready"]
interval: 15s
timeout: 3s
retries: 5
networks:
- backend
redis-cache:
image: redis:7-alpine
command: redis-server --maxmemory 1gb --maxmemory-policy allkeys-lru
networks:
- backend
networks:
backend:
driver: bridge
Quick Start Guide
- Initialize the project structure: Create
src/, requirements.txt, and Dockerfile.production. Add fastapi, uvicorn, httpx, and pydantic to requirements.
- Set environment variables: Export
PROVIDER_AUTH_TOKEN with your ofox.ai credential. Verify the token has inference permissions.
- Build and validate: Run
docker build -f Dockerfile.production -t llm-gateway:latest .. Confirm the image size is under 500MB.
- Launch locally: Execute
docker compose -f docker-compose.prod.yml up -d. Monitor startup with docker compose logs -f llm-gateway.
- Test the endpoint: Send a POST request to
http://localhost:8080/v1/infer with a valid JSON payload. Verify the response contains generated text and token usage metrics.