install dependencies. The stack includes fastapi for the framework, uvicorn for the ASGI server, httpx for async HTTP communication, and pydantic for data validation.
mkdir ai-inference-gateway
cd ai-inference-gateway
python3 -m venv venv
source venv/bin/activate
pip install fastapi uvicorn httpx pydantic python-dotenv
2. Service Architecture
The implementation uses Pydantic models to define the contract for incoming requests and outgoing responses. An async httpx client manages connections to the ofox.ai endpoint. Authentication is handled via a dependency injection pattern.
main.py
import os
from typing import List, Optional
from fastapi import FastAPI, Depends, HTTPException, status
from fastapi.security import HTTPBearer, HTTPAuthorizationCredentials
from fastapi.responses import StreamingResponse
from pydantic import BaseModel, Field
import httpx
import asyncio
# Configuration
API_KEY = os.getenv("OFOX_API_KEY")
BASE_URL = "https://api.ofox.ai/v1"
SECURITY_SCHEME = HTTPBearer()
# Pydantic Models
class ConversationTurn(BaseModel):
role: str
content: str
class InferencePayload(BaseModel):
model: str = Field(default="claude-3-5-sonnet-20241022")
messages: List[ConversationTurn]
max_tokens: Optional[int] = Field(default=1024, ge=1, le=4096)
temperature: Optional[float] = Field(default=0.7, ge=0.0, le=2.0)
class InferenceResult(BaseModel):
content: str
model_id: str
token_count: int
# Application Instance
llm_gateway = FastAPI(title="AI Inference Gateway", version="1.0.0")
# Dependencies
async def validate_credentials(credentials: HTTPAuthorizationCredentials = Depends(SECURITY_SCHEME)):
if credentials.scheme != "Bearer" or credentials.credentials != API_KEY:
raise HTTPException(
status_code=status.HTTP_401_UNAUTHORIZED,
detail="Invalid or missing API key"
)
return credentials.credentials
# Endpoints
@llm_gateway.post("/v1/inference", response_model=InferenceResult)
async def handle_inference(
payload: InferencePayload,
token: str = Depends(validate_credentials)
):
async with httpx.AsyncClient(timeout=60.0) as client:
try:
response = await client.post(
f"{BASE_URL}/chat/completions",
headers={"Authorization": f"Bearer {token}"},
json={
"model": payload.model,
"messages": [turn.model_dump() for turn in payload.messages],
"max_tokens": payload.max_tokens,
"temperature": payload.temperature
}
)
response.raise_for_status()
data = response.json()
return InferenceResult(
content=data["choices"][0]["message"]["content"],
model_id=data["model"],
token_count=data["usage"]["total_tokens"]
)
except httpx.HTTPStatusError as exc:
raise HTTPException(status_code=exc.response.status_code, detail=exc.response.text)
except Exception as exc:
raise HTTPException(status_code=500, detail="Internal service error")
@llm_gateway.post("/v1/inference/stream")
async def handle_streaming_inference(
payload: InferencePayload,
token: str = Depends(validate_credentials)
):
async def generate_tokens():
async with httpx.AsyncClient(timeout=60.0) as client:
async with client.stream(
"POST",
f"{BASE_URL}/chat/completions",
headers={"Authorization": f"Bearer {token}"},
json={
"model": payload.model,
"messages": [turn.model_dump() for turn in payload.messages],
"max_tokens": payload.max_tokens,
"stream": True
}
) as response:
response.raise_for_status()
async for line in response.aiter_lines():
if line.startswith("data: "):
payload_str = line[6:]
if payload_str == "[DONE]":
break
yield f"{payload_str}\n\n"
return StreamingResponse(
generate_tokens(),
media_type="text/event-stream"
)
3. Architecture Decisions
- Async Client (
httpx): Used instead of synchronous libraries to prevent blocking the event loop during I/O waits. This allows the service to handle multiple inference requests concurrently without spawning additional threads.
- Pydantic Validation: Schemas enforce type safety and constraints (e.g.,
max_tokens range) before the request reaches the business logic, reducing downstream errors.
- Dependency Injection: Authentication is decoupled from endpoint logic using
Depends, promoting reusability and cleaner code structure.
- Streaming Implementation: The
/stream endpoint uses StreamingResponse with an async generator to yield Server-Sent Events (SSE) as tokens arrive, minimizing time-to-first-token for clients.
- Error Handling:
httpx.HTTPStatusError is caught to propagate upstream errors accurately, while generic exceptions are masked to prevent information leakage.
Pitfall Guide
Production AI services face unique challenges. The following pitfalls and remedies are derived from real-world deployment experience.
-
Event Loop Blocking
- Explanation: Using synchronous HTTP clients (e.g.,
requests) inside async endpoints blocks the entire event loop, causing severe latency spikes under load.
- Fix: Always use async-compatible libraries like
httpx or aiohttp for external API calls.
-
Context Window Overflow
- Explanation: Sending prompts that exceed the model's context limit results in API errors and wasted compute.
- Fix: Implement token counting logic or length validation in Pydantic models before forwarding requests.
-
Streaming State Corruption
- Explanation: Improper parsing of SSE chunks can lead to malformed JSON or missing tokens in the client stream.
- Fix: Strictly validate the
data: prefix and handle the [DONE] sentinel correctly. Test with network jitter to ensure robustness.
-
Credential Hardcoding
- Explanation: Embedding API keys in source code poses a security risk and complicates rotation.
- Fix: Use environment variables via
python-dotenv or a secrets manager. Never commit keys to version control.
-
Timeout Neglect
- Explanation: LLM inference times can vary. Default timeouts may cause premature request termination.
- Fix: Configure explicit timeouts in
httpx.AsyncClient and align them with service-level objectives (SLOs).
-
Retry Storms
- Explanation: Blindly retrying failed requests can overwhelm the upstream API during outages.
- Fix: Implement exponential backoff with jitter for transient errors. Use circuit breakers for persistent failures.
-
Pydantic v2 Serialization
- Explanation: Using deprecated methods like
.dict() in Pydantic v2 raises warnings or errors.
- Fix: Use
.model_dump() for serialization and .model_validate() for deserialization.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Real-time Chat UI | Streaming Endpoint | Reduces perceived latency; improves UX | Higher bandwidth usage |
| Batch Processing | Non-streaming Endpoint | Simpler error handling; easier aggregation | Lower overhead per request |
| High-Volume Repetitive Queries | Response Caching | Avoids redundant API calls | Reduced API costs; added cache infra |
| Strict Compliance | Input Sanitization Middleware | Prevents prompt injection and data leaks | Minimal compute overhead |
Configuration Template
Dockerfile
FROM python:3.11-slim AS base
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
EXPOSE 8000
CMD ["uvicorn", "main:llm_gateway", "--host", "0.0.0.0", "--port", "8000"]
pyproject.toml (Dependencies)
[project]
name = "ai-inference-gateway"
version = "1.0.0"
dependencies = [
"fastapi>=0.100.0",
"uvicorn>=0.23.0",
"httpx>=0.24.0",
"pydantic>=2.0.0",
"python-dotenv>=1.0.0"
]
Quick Start Guide
- Initialize Environment: Run
python3 -m venv venv && source venv/bin/activate and install dependencies.
- Configure Secrets: Create a
.env file with OFOX_API_KEY=your-key-here.
- Launch Service: Execute
uvicorn main:llm_gateway --reload to start the development server.
- Verify Endpoint: Access
http://localhost:8000/docs to test the /v1/inference endpoint via the interactive UI.
- Deploy: Build the Docker image using
docker build -t ai-gateway . and run with docker run -p 8000:8000 --env-file .env ai-gateway.