itoring (Error/Uptime) | 168 hours | 2% | $0 | Low |
| PSI-Only Detection | 48 hours | 8% | $150,000 | Low |
| KS Test + Distribution Analysis | 12 hours | 3% | $320,000 | Medium |
| Integrated Drift Pipeline (Evidently AI) | 4 hours | 2% | $480,000 | Medium |
Key Findings:
- PSI provides rapid, business-friendly thresholding but lacks sensitivity to subtle distributional changes in high-dimensional spaces.
- KS tests deliver statistically rigorous p-value validation (< 0.05) and catch earlier shifts, but require minimum sample sizes (~1,000+ records) to avoid Type II errors.
- The sweet spot for production is a hybrid pipeline: PSI for business reporting and alerting, KS for statistical validation, and distribution plots for root-cause debugging. Automated logging to JSONL enables near real-time drift scoring without blocking inference latency.
Core Solution
Implementing a production-grade drift detection system requires three layers: statistical scoring, tooling selection, and inference-side logging.
Statistical Foundation
- PSI (Population Stability Index): Measures distribution shift between reference and current samples. Thresholds:
< 0.1 (stable), 0.1β0.25 (investigate), > 0.25 (retrain). Fast, interpretable, industry standard.
- KS Test (Kolmogorov-Smirnov): Non-parametric test comparing cumulative distributions. Returns p-value;
< 0.05 indicates statistical difference. Superior for smaller samples and continuous features.
- Distribution Plots: Visual validation of mean/variance shifts, multimodal emergence, and feature-level attribution. Essential for stakeholder communication and debugging.
| Tool | Type | Best For |
|---|
| Evidently AI | Open source | Self-hosted drift reports, full customization |
| WhyLabs | SaaS (free tier) | Teams without dedicated ML infra |
| Prometheus + Grafana | Infrastructure | Drift as time-series metrics, custom alerting |
| MLflow | Open source | Teams already using MLflow for experiment tracking |
Architecture Decision: Evidently AI is selected for this implementation due to zero licensing cost, local execution, automated PSI/KS generation, and HTML report output.
Implementation Steps
Step 1 β Install Evidently
pip install evidently
Step 2 β Log predictions from your FastAPI endpoint
Add prediction logging to your /predict endpoint. Every prediction gets stored to a JSONL file for later drift analysis.
# Add this to your FastAPI /predict endpoint
import json
from datetime import datetime
def log_prediction(features: dict, prediction: int, probability: float):
log_entry = {
"timestamp": datetime.utcnow().isoformat(),
"features": features,
"prediction": prediction,
"probability": probability
}
with open("predictions.jsonl", "a") as f:
f.write(json.dumps(log_entry) + "\n")
Call log_prediction() inside your endpoint every time you serve a result. The JSONL format appends one JSON object per line β it's cheap, crash-safe, and trivial to read back.
Step 3 β Load your reference (training) distribution
import pandas as pd
from evidently import ColumnMapping
# Load the same data the model was trained on
reference_data = pd.read_csv("data/training_features.csv")
reference_data["prediction"] = training_predictions
reference_data["probability"] = training_probabilities
Pitfall Guide
- Static Threshold Rigidity: Applying fixed PSI thresholds (e.g.,
0.25) across all features ignores business context. Critical features (e.g., fraud score, pricing) require tighter bounds (< 0.1), while auxiliary features can tolerate higher drift. Calibrate thresholds per feature business impact.
- Small Sample Size Fallacy: Running KS tests or PSI on rolling windows with
< 1,000 samples produces statistically unstable p-values and inflated false positives. Implement minimum batch sizes or exponential smoothing before scoring.
- Raw Feature vs. Engineered Feature Mismatch: Monitoring only raw inputs misses drift introduced during preprocessing, feature engineering, or normalization pipelines. Always score drift on the exact tensors/features fed into the model.
- Logging Without Schema Alignment: JSONL logs that drift in schema (added/removed columns, type changes) break Evidently's column mapping. Enforce strict schema validation at ingestion and version-control your reference dataset.
- Detection Without Automated Remediation: Flagging drift without triggering retraining pipelines, model rollbacks, or canary deployments creates operational bottlenecks. Pair detection with CI/CD hooks or MLOps orchestrators (Airflow, Kubeflow, MLflow).
- Ignoring Prediction Drift: Focusing solely on input features delays detection when concept drift occurs. Monitor output distributions and prediction probabilities as leading indicators to catch upstream relationship shifts before labels arrive.
- Compliance Blind Spots: In regulated domains (finance, healthcare), drift logs must be immutable, timestamped, and auditable. Standard JSONL files lack cryptographic integrity. Use append-only storage with checksums or integrate with audit-compliant data lakes.
Deliverables
- π Drift Detection Blueprint: Architecture diagram covering inference logging β batch scoring β statistical validation β alerting β retraining triggers. Includes data flow for FastAPI + JSONL + Evidently AI + Prometheus.
- β
Production Readiness Checklist: 24-point validation covering reference data versioning, schema enforcement, threshold calibration, sample size requirements, alert routing, rollback procedures, and compliance logging.
- βοΈ Configuration Templates: Ready-to-deploy Evidently AI JSON profiles (PSI/KS thresholds per feature type), FastAPI middleware snippet for non-blocking JSONL logging, and Prometheus alert rules for MTTD optimization.