Implementation Steps
1. Define the Model Adapter Interface
The adapter interface standardizes how the gateway interacts with any partner. This ensures that business logic remains agnostic to the underlying provider.
export interface InferenceRequest {
modelId: string;
prompt: string;
parameters: Record<string, unknown>;
metadata: {
tenantId: string;
userId: string;
correlationId: string;
};
}
export interface InferenceResponse {
content: string;
tokensUsed: number;
modelVersion: string;
latencyMs: number;
metadata: Record<string, unknown>;
}
export interface ModelAdapter {
readonly providerId: string;
infer(request: InferenceRequest): Promise<InferenceResponse>;
healthCheck(): Promise<boolean>;
getSchema(): JsonSchema;
}
2. Implement the Partnership Gateway
The gateway orchestrates the request lifecycle. It includes PII redaction, circuit breaking, and metering. This code demonstrates a production-grade gateway with safety controls.
import { CircuitBreaker } from 'opossum';
import { PIIRedactor } from './pii-redactor';
import { MeteringService } from './metering';
import { ModelAdapter, InferenceRequest, InferenceResponse } from './types';
export class PartnershipGateway {
private adapters: Map<string, ModelAdapter>;
private circuitBreakers: Map<string, CircuitBreaker>;
private redactor: PIIRedactor;
private meter: MeteringService;
constructor() {
this.adapters = new Map();
this.circuitBreakers = new Map();
this.redactor = new PIIRedactor({ entities: ['EMAIL', 'SSN', 'PHONE'] });
this.meter = new MeteringService();
}
registerAdapter(adapter: ModelAdapter): void {
this.adapters.set(adapter.providerId, adapter);
// Circuit breaker: 50% error rate threshold, 10s timeout
const breaker = new CircuitBreaker(adapter.infer.bind(adapter), {
timeout: 5000,
errorThresholdPercentage: 50,
resetTimeout: 10000,
});
this.circuitBreakers.set(adapter.providerId, breaker);
}
async route(request: InferenceRequest): Promise<InferenceResponse> {
const adapter = this.adapters.get(request.modelId);
if (!adapter) {
throw new Error(`Adapter not found for model: ${request.modelId}`);
}
// 1. PII Redaction
const sanitizedRequest = {
...request,
prompt: this.redactor.sanitize(request.prompt),
};
// 2. Metering and Quota Check
await this.meter.checkQuota(request.metadata.tenantId, request.modelId);
// 3. Execute with Circuit Breaker
const breaker = this.circuitBreakers.get(request.modelId)!;
const startTime = Date.now();
try {
const response = await breaker.fire(sanitizedRequest);
// 4. Post-processing and Validation
const validatedResponse = await this.validateSchema(response, adapter);
// 5. Record Metrics
await this.meter.recordUsage(
request.metadata.tenantId,
request.modelId,
response.tokensUsed,
Date.now() - startTime
);
return validatedResponse;
} catch (error) {
// Fallback logic or error propagation
await this.meter.recordError(request.modelId, error);
throw new PartnershipError(`Inference failed via ${request.modelId}`, error);
}
}
private async validateSchema(
response: InferenceResponse,
adapter: ModelAdapter
): Promise<InferenceResponse> {
// Implement JSON schema validation against adapter contract
// If validation fails, trigger alert and potentially fallback
return response;
}
}
class PartnershipError extends Error {
constructor(message: string, public cause?: Error) {
super(message);
this.name = 'PartnershipError';
}
}
3. Multi-Tenant Metering and Cost Control
Partnerships often involve complex billing models. The gateway must enforce cost caps and provide granular usage reporting.
export class MeteringService {
private quotas: Map<string, TenantQuota> = new Map();
async checkQuota(tenantId: string, modelId: string): Promise<void> {
const quota = this.quotas.get(tenantId);
if (!quota) throw new QuotaExceededError('No quota defined');
const currentUsage = await this.getUsage(tenantId, modelId);
if (currentUsage >= quota.maxTokens) {
throw new QuotaExceededError(`Tenant ${tenantId} exceeded token limit`);
}
}
async recordUsage(
tenantId: string,
modelId: string,
tokens: number,
latency: number
): Promise<void> {
// Stream metrics to analytics pipeline
// Update real-time counters for quota enforcement
await this.analyticsSink.emit({
event: 'inference_complete',
tenantId,
modelId,
tokens,
latency,
timestamp: Date.now(),
});
}
}
4. Configuration-Driven Routing
Hardcoding routing logic limits flexibility. Use a configuration file to define model aliases, fallback chains, and partner credentials.
# ai-partnership-config.yaml
partners:
- id: partner-alpha
adapter: OpenAIAdapter
endpoint: https://api.partner-alpha.com/v1
api_key_ref: secrets/partner-alpha/key
fallbacks: [partner-beta]
rate_limit: 1000 req/min
schema_version: v2.1
- id: partner-beta
adapter: AnthropicAdapter
endpoint: https://api.partner-beta.com
api_key_ref: secrets/partner-beta/key
fallbacks: []
rate_limit: 500 req/min
schema_version: v1.0
routing:
models:
- alias: "fast-text"
primary: partner-alpha
fallback_chain: [partner-beta]
- alias: "secure-code"
primary: partner-beta
fallback_chain: []
pii_filter: strict
Pitfall Guide
1. Hardcoding Provider SDKs
Mistake: Importing specific SDKs (e.g., @openai/api) directly into business logic classes.
Impact: Vendor lock-in becomes structural. Switching partners requires rewriting core code, testing all flows, and redeploying the entire stack.
Best Practice: Use the Adapter pattern. Business logic should only depend on the ModelAdapter interface. SDK imports are isolated within the adapter implementation.
2. Ignoring Token Drift and Cost Volatility
Mistake: Assuming token counts remain constant for similar prompts.
Impact: Partners may update models, changing tokenization efficiency or output length. Costs can spike 300% overnight without warning.
Best Practice: Implement real-time token monitoring and alerting. Set hard caps on max_tokens and use streaming responses to cut off excessive generation. Monitor cost-per-request trends, not just total spend.
3. Lack of Schema Versioning
Mistake: Assuming model output structure remains stable.
Impact: Upstream model updates can change JSON keys, remove fields, or alter types, breaking downstream parsers.
Best Practice: Define strict JSON schemas for inputs and outputs. Implement a schema registry. If a partner updates their model, require a new schema version. The gateway should validate responses against the expected schema and trigger alerts on deviation.
4. Insufficient PII Redaction
Mistake: Relying on the partner to handle data privacy or using simple regex.
Impact: Regulatory violations (GDPR, HIPAA). Data leakage of sensitive user information to third-party endpoints.
Best Practice: Deploy a dedicated PII redaction service before requests leave your infrastructure. Use NER (Named Entity Recognition) models to detect and mask sensitive data. Ensure redaction is configurable per partnership based on data classification.
5. Missing Fallback Strategies
Mistake: Single-point-of-failure integration with one partner.
Impact: Partner outage causes complete service degradation.
Best Practice: Configure fallback chains. If partner-alpha fails or exceeds latency thresholds, automatically route to partner-beta. Implement circuit breakers to prevent cascading failures. Test fallbacks regularly via chaos engineering.
6. Inadequate Observability
Mistake: Treating AI calls as black boxes with no tracing.
Impact: Inability to diagnose latency spikes, errors, or quality degradation.
Best Practice: Instrument every gateway call with distributed tracing. Include correlationId in all requests. Log input/output hashes (not raw content for privacy), latency, token usage, and model version. Correlate traces across the gateway and partner endpoints.
7. Neglecting Rate Limit Handling
Mistake: Failing to implement exponential backoff and jitter.
Impact: Throttling errors from partners lead to request failures and poor user experience.
Best Practice: Implement robust retry logic with exponential backoff. Respect Retry-After headers. Use client-side rate limiting to stay within partner quotas. Queue requests during burst traffic rather than dropping them.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| High Compliance / Regulated Data | Federated Gateway + Strict PII Redaction | Ensures data never leaves control; policy enforcement at edge. | High (Infra + Redaction costs) |
| Low Latency / Real-Time UX | Direct Client Integration or Edge Proxy | Minimizes network hops; reduces latency overhead. | Low (Network costs) |
| Multi-Model Routing / A/B Testing | Centralized Proxy with Routing Config | Enables dynamic model switching without client updates. | Medium (Proxy infra) |
| Cost-Sensitive / High Volume | Co-Hosted Inference or Reserved Capacity | Fixed compute costs; avoids token-based variability. | Low-Medium (Fixed infra) |
| Rapid Prototyping / MVP | Direct Client Integration | Fastest implementation; minimal overhead. | Low |
| Vendor Diversification Strategy | Federated Gateway + Adapter Pattern | Prevents lock-in; enables seamless partner swaps. | High (Initial dev cost) |
Configuration Template
# gateway-config.yaml
gateway:
port: 8080
tracing:
enabled: true
exporter: otel
metrics:
enabled: true
endpoint: /metrics
partners:
- id: model-provider-a
adapter: openai
endpoint: ${PROVIDER_A_ENDPOINT}
api_key: ${PROVIDER_A_KEY}
rate_limit: 2000
timeout_ms: 5000
retry:
max_attempts: 3
backoff: exponential
pii:
enabled: true
entities: [EMAIL, PHONE, SSN]
action: mask
routing:
models:
- name: chat-assistant
adapter: model-provider-a
fallback: model-provider-b
schema: chat_response_v1.json
cost_per_token: 0.00002
- name: code-completion
adapter: model-provider-c
fallback: []
schema: code_response_v1.json
cost_per_token: 0.00001
quotas:
default:
tokens_per_month: 1000000
requests_per_minute: 60
tiers:
enterprise:
tokens_per_month: 10000000
requests_per_minute: 500
Quick Start Guide
-
Initialize Gateway Project:
npm init -y
npm install express opossum zod
-
Create Adapter Skeleton:
Create src/adapters/BaseAdapter.ts with the ModelAdapter interface and a mock implementation to test routing.
-
Configure Gateway:
Set up gateway-config.yaml with a mock partner. Configure the PartnershipGateway to load the config and register adapters.
-
Implement PII Redaction:
Integrate a library like pii-redactor or a custom NER service. Wire it into the gateway request pipeline.
-
Deploy and Test:
Start the gateway server. Send test requests with PII data. Verify redaction, metering logs, and circuit breaker behavior. Validate that fallback routing triggers on simulated partner failure.
This architecture provides a scalable, secure, and maintainable foundation for AI partnerships. By enforcing abstraction, policy, and observability, engineering teams can productize AI integrations with the same rigor as core infrastructure, mitigating risk and enabling rapid innovation.