an SDK for fine-grained control over feature capture and PII redaction. Sidecars are useful for traffic mirroring but lack context about model schema.
2. Baseline Storage: Store reference distributions in a feature store or dedicated vector database. Baselines must be versioned per model release.
3. Compute Offloading: Drift detection calculations (e.g., Population Stability Index, KL Divergence) should be computed asynchronously on streaming data or batch windows to avoid impacting inference latency.
4. Ground Truth Pipeline: Accuracy monitoring requires ground truth. Design a pipeline to ingest delayed labels (e.g., chargeback data arriving 30 days post-transaction) and reconcile them with predictions.
Step-by-Step Implementation
1. Instrumentation and Data Capture
Wrap the model inference endpoint to capture inputs, outputs, and metadata. Ensure PII is redacted before logging.
import { createHash } from 'crypto';
interface InferenceRequest {
userId: string;
features: Record<string, number | string>;
timestamp: number;
}
interface InferenceResponse {
prediction: number;
confidence: number;
modelVersion: string;
}
export class AIObserver {
private logger: any; // Inject your logging infrastructure
private baselineManager: BaselineManager;
constructor(logger: any, baselineManager: BaselineManager) {
this.logger = logger;
this.baselineManager = baselineManager;
}
async observe<T extends InferenceRequest, U extends InferenceResponse>(
request: T,
response: U,
executionTimeMs: number
): Promise<void> {
// 1. Redact PII
const sanitizedRequest = this.redactPII(request);
// 2. Capture telemetry
const telemetry = {
requestId: crypto.randomUUID(),
modelVersion: response.modelVersion,
timestamp: Date.now(),
executionTime: executionTimeMs,
inputFeatures: sanitizedRequest.features,
output: response.prediction,
confidence: response.confidence,
};
// 3. Async drift analysis (non-blocking)
this.analyzeDrift(telemetry).catch(err =>
this.logger.error('Drift analysis failed', err)
);
// 4. Log to observability backend
this.logger.info('inference_telemetry', telemetry);
}
private redactPII(req: InferenceRequest): InferenceRequest {
return {
...req,
userId: createHash('sha256').update(req.userId).digest('hex').substring(0, 8),
};
}
private async analyzeDrift(telemetry: any): Promise<void> {
// Implementation details in next section
}
}
2. Drift Detection Algorithms
Implement statistical tests to compare current input distributions against baselines. The Population Stability Index (PSI) is the standard metric for feature-level drift.
export class DriftDetector {
/**
* Calculates Population Stability Index (PSI)
* PSI < 0.1: No significant change
* 0.1 <= PSI < 0.2: Moderate change
* PSI >= 0.2: Significant change (Alert)
*/
static calculatePSI(
referenceDistribution: number[],
currentDistribution: number[],
bins: number = 10
): number {
const expectedPercentages = this.computeBinPercentages(referenceDistribution, bins);
const actualPercentages = this.computeBinPercentages(currentDistribution, bins);
let psi = 0;
for (let i = 0; i < bins; i++) {
const expected = Math.max(expectedPercentages[i], 1e-6);
const actual = Math.max(actualPercentages[i], 1e-6);
psi += (actual - expected) * Math.log(actual / expected);
}
return psi;
}
private static computeBinPercentages(data: number[], bins: number): number[] {
if (data.length === 0) return new Array(bins).fill(0);
const min = Math.min(...data);
const max = Math.max(...data);
const range = max - min || 1;
const binWidth = range / bins;
const counts = new Array(bins).fill(0);
data.forEach(val => {
const binIndex = Math.min(Math.floor((val - min) / binWidth), bins - 1);
counts[binIndex]++;
});
return counts.map(c => c / data.length);
}
}
3. Concept Drift and Accuracy Monitoring
Input drift does not always impact performance. Concept drift occurs when the relationship between inputs and outputs changes. Monitor accuracy using a rolling window of ground truth data.
export class AccuracyMonitor {
private groundTruthQueue: Map<string, { prediction: number; actual: number; timestamp: number }>;
constructor() {
this.groundTruthQueue = new Map();
}
// Called when ground truth arrives (e.g., via webhook or batch job)
reconcile(predictionId: string, actualValue: number): void {
const record = this.groundTruthQueue.get(predictionId);
if (record) {
record.actual = actualValue;
this.evaluateDrift(record);
}
}
private evaluateDrift(record: { prediction: number; actual: number; timestamp: number }): void {
const error = Math.abs(record.prediction - record.actual);
// Push to time-series DB for trend analysis
// Alert if rolling average error exceeds threshold
}
}
4. Alerting and Remediation
Configure alerts based on composite conditions. Avoid alerting on single feature drift; alert on aggregate drift or performance degradation.
- Critical: PSI > 0.25 on critical features OR Accuracy drop > 15% over 24h.
- Warning: PSI > 0.15 OR Confidence score distribution shift > 20%.
- Action: Trigger Slack/PagerDuty notification. Optionally invoke a Lambda function to switch traffic to a shadow model or fallback rule.
Pitfall Guide
- Monitoring Infrastructure Only: Relying solely on latency and error rates is the most common failure. A model can return "0" for every request with 200 OK status. Always monitor prediction distributions and confidence scores.
- Ignoring Schema Evolution: Data pipelines change. A new feature added upstream or a type change in an existing feature can break the model silently. Implement schema validation checks on every inference request.
- The Ground Truth Gap: Accuracy cannot be monitored without labels. If labels arrive with a 30-day delay, you cannot detect concept drift in real-time. Design hybrid monitoring: use proxy metrics (e.g., user engagement, click-through rates) for immediate feedback while waiting for ground truth.
- Alert Fatigue from Sensitive Thresholds: Setting drift thresholds too low generates noise. Features naturally fluctuate. Use statistical significance testing (e.g., Kolmogorov-Smirnov test) rather than fixed thresholds to reduce false positives.
- Treating All Features Equally: Not all features contribute equally to the model's decision. Drift in a low-importance feature may be harmless. Weight drift alerts by feature importance scores derived from the model (SHAP values or coefficients).
- Privacy Violations in Logging: AI monitoring requires capturing input data, which often contains PII. Failure to redact sensitive information before logging violates GDPR/CCPA. Implement automated redaction pipelines at the edge.
- Feedback Loop Latency: If the monitoring system processes data with high latency, alerts arrive too late to mitigate damage. Ensure the monitoring pipeline processes data within minutes, not hours. Use streaming architectures (Kafka/Kinesis) for real-time drift detection.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Real-time Fraud Detection | Streaming Drift Detection + Low-Latency Alerts | Fraud patterns evolve rapidly; delayed detection causes direct financial loss. | High (Requires streaming infra, but ROI justifies cost via loss prevention). |
| Batch Recommendation Engine | Hourly Batch PSI + Daily Accuracy Review | User preferences shift slowly; real-time monitoring adds unnecessary complexity. | Low (Batch processing reuses existing data pipeline resources). |
| High-Volume Low-Risk Model | Sampling-Based Monitoring | Full monitoring of millions of requests is cost-prohibitive. Statistical sampling maintains confidence. | Medium (Reduces storage/compute costs by 90% while preserving detection capability). |
| Regulated Domain (Healthcare) | Strict Schema Validation + Ground Truth Audit | Compliance requires traceability and validation of every prediction against labeled data. | High (Requires rigorous audit trails and human-in-the-loop validation). |
Configuration Template
Use this YAML configuration to define monitoring parameters for a model deployment. This template can be integrated into your CI/CD pipeline to enforce monitoring standards.
model_monitoring:
model_id: "churn-prediction-v2"
version: "2.4.1"
baselines:
source: "feature_store"
reference_dataset: "production_snapshot_2023_q4"
update_frequency: "monthly"
drift_detection:
algorithms:
- name: "PSI"
target: "input_features"
threshold: 0.20
window: "24h"
- name: "KS_TEST"
target: "predictions"
threshold: 0.05
window: "1h"
feature_weights:
auto: true # Derive from model SHAP values
overrides:
"user_age": 0.8
"login_frequency": 0.9
accuracy_monitoring:
ground_truth_delay: "7d"
proxy_metrics:
- name: "user_retention_rate"
correlation_threshold: 0.7
alert_conditions:
- metric: "rolling_accuracy"
operator: "lt"
value: 0.85
duration: "4h"
alerting:
channels:
- type: "pagerduty"
severity: "critical"
conditions: ["psi_critical", "accuracy_drop"]
- type: "slack"
channel: "#ml-ops"
conditions: ["psi_warning"]
remediation:
auto_rollback:
enabled: true
trigger: "psi_critical"
fallback_model: "churn-prediction-v2.3"
shadow_mode:
enabled: true
duration: "48h"
trigger: "accuracy_drop"
Quick Start Guide
Get AI monitoring running in under 5 minutes using the @codcompass/ai-observer SDK.
-
Install the SDK:
npm install @codcompass/ai-observer
-
Initialize the Observer:
Wrap your model inference function with the observer. Pass your configuration and logger.
import { AIObserver } from '@codcompass/ai-observer';
const observer = new AIObserver({
apiKey: process.env.CODCOMPASS_API_KEY,
modelId: 'my-model-v1',
redactPII: true,
});
-
Instrument Inference:
Call observe after your model returns a prediction. The SDK handles drift calculation and telemetry asynchronously.
const result = await myModel.predict(input);
await observer.observe(input, result, Date.now());
-
Set Baseline:
Run the baseline command to establish reference distributions using your training or recent production data.
npx ai-observer baseline --data-path ./data/baseline.csv --output ./config/baseline.json
-
Verify Dashboard:
Log in to the Codcompass dashboard to view real-time drift charts, accuracy trends, and alert status. Adjust thresholds based on initial observations.
By implementing this structured approach, engineering teams transform AI systems from opaque liabilities into observable, manageable assets. Monitoring AI is not optional; it is the foundation of reliable, scalable machine learning in production.