ns
- Strict Schema Contract: Every feature vector must declare its dimensionality, type, and expected transformation. This prevents silent misalignment between training and inference.
- Fit/Transform Separation: Scaling parameters (mean, std, min, max) must be computed exclusively on training data. Applying global statistics to inference data causes data leakage and distribution mismatch.
- Pipeline Composition: Transformations are chained as pure functions with explicit state management. This enables reproducible preprocessing and simplifies versioning.
- Unknown Category Handling: Categorical encoders must include a fallback mechanism for unseen values during inference, preventing runtime crashes.
Implementation
// Feature schema definition
interface FeatureConfig {
name: string;
type: 'numerical' | 'categorical';
scalingMethod?: 'minmax' | 'zscore' | 'none';
categories?: string[];
fallbackIndex?: number;
}
// Pipeline state for fitted parameters
interface PipelineState {
numerical: Record<string, { min: number; max: number; mean: number; std: number }>;
categorical: Record<string, Map<string, number>>;
}
class FeaturePipeline {
private schema: FeatureConfig[];
private state: PipelineState;
constructor(schema: FeatureConfig[]) {
this.schema = schema;
this.state = { numerical: {}, categorical: {} };
}
// Compute statistics exclusively from training data
fit(data: Record<string, unknown>[]): void {
this.schema.forEach((config) => {
if (config.type === 'numerical') {
const values = data.map((row) => Number(row[config.name])).filter((v) => !isNaN(v));
const min = Math.min(...values);
const max = Math.max(...values);
const mean = values.reduce((a, b) => a + b, 0) / values.length;
const std = Math.sqrt(values.reduce((sq, n) => sq + Math.pow(n - mean, 2), 0) / values.length);
this.state.numerical[config.name] = { min, max, mean, std };
} else if (config.type === 'categorical' && config.categories) {
const categoryMap = new Map<string, number>();
config.categories.forEach((cat, idx) => categoryMap.set(cat, idx));
this.state.categorical[config.name] = categoryMap;
}
});
}
// Transform raw input into aligned feature vector
transform(rawInput: Record<string, unknown>): number[] {
const vector: number[] = [];
this.schema.forEach((config) => {
const value = rawInput[config.name];
if (config.type === 'numerical') {
const numVal = Number(value);
if (isNaN(numVal)) {
vector.push(0); // Handle missing values explicitly
return;
}
const stats = this.state.numerical[config.name];
if (!stats) throw new Error(`Missing fit state for numerical feature: ${config.name}`);
if (config.scalingMethod === 'minmax') {
const range = stats.max - stats.min;
vector.push(range === 0 ? 0 : (numVal - stats.min) / range);
} else if (config.scalingMethod === 'zscore') {
vector.push(stats.std === 0 ? 0 : (numVal - stats.mean) / stats.std);
} else {
vector.push(numVal);
}
} else if (config.type === 'categorical') {
const strVal = String(value);
const catMap = this.state.categorical[config.name];
if (!catMap) throw new Error(`Missing fit state for categorical feature: ${config.name}`);
const knownIndex = catMap.get(strVal);
if (knownIndex !== undefined) {
// One-hot expansion
const hotVector = new Array(catMap.size).fill(0);
hotVector[knownIndex] = 1;
vector.push(...hotVector);
} else {
// Fallback for unseen categories
const fallback = config.fallbackIndex ?? catMap.size;
const hotVector = new Array(catMap.size + 1).fill(0);
hotVector[fallback] = 1;
vector.push(...hotVector);
}
}
});
return vector;
}
getDimensionality(): number {
let dim = 0;
this.schema.forEach((config) => {
if (config.type === 'numerical') dim += 1;
else if (config.type === 'categorical' && config.categories) dim += config.categories.length + 1; // +1 for fallback
});
return dim;
}
}
Why This Architecture Works
The pipeline enforces a strict contract between data shape and model expectations. By separating fit and transform, we eliminate training-inference distribution mismatch. The schema-driven approach makes dimensionality explicit, allowing engineers to track feature bloat before it impacts memory or latency. Categorical fallbacks prevent production crashes when new business entities appear. This structure scales from single-node inference to distributed batch processing without architectural changes.
Pitfall Guide
1. Schema Drift Between Training and Inference
Explanation: The order or type of features changes between model training and production inference. The model receives misaligned coordinates, causing silent prediction corruption.
Fix: Enforce a typed schema contract. Validate input shape and type at the API boundary. Use versioned feature registries that reject mismatched payloads.
2. Data Leakage via Global Scaling
Explanation: Computing min/max or mean/std across the entire dataset before splitting. The model indirectly learns test distribution statistics, inflating validation metrics and causing deployment failure.
Fix: Always fit scaling parameters on training data only. Store fitted state and apply it deterministically to validation, test, and inference streams.
3. One-Hot Dimensionality Explosion
Explanation: High-cardinality categorical features (e.g., user IDs, product SKUs) multiplied by one-hot encoding create sparse, massive vectors. Memory usage and inference latency spike.
Fix: Apply target encoding, frequency hashing, or embedding layers for cardinality > 50. Use dimensionality reduction (PCA, UMAP) or feature selection to prune redundant axes.
4. Ignoring Outliers in Min-Max Normalization
Explanation: Extreme values compress the valid range of other features into a narrow band, destroying discriminative power for the majority of data points.
Fix: Use robust scaling (IQR-based), apply percentile clipping, or switch to Z-score standardization which preserves outlier visibility without range collapse.
5. Treating High Dimensionality as a Free Lunch
Explanation: Adding more features without regularization or selection increases noise, slows convergence, and triggers overfitting. Distance metrics become meaningless in sparse high-dimensional spaces.
Fix: Implement feature importance filtering, L1 regularization, or automated selection pipelines. Monitor validation loss divergence as a signal of dimensionality overload.
6. Mixing Scaling Strategies Across Features
Explanation: Applying normalization to some features and standardization to others within the same vector creates inconsistent gradient landscapes. Optimizers struggle to navigate misaligned curvature.
Fix: Standardize all numerical features unless bounded constraints explicitly require Min-Max. Document scaling choices per feature in the schema registry.
7. Hardcoding Category Mappings
Explanation: Embedding category lists directly in transformation logic breaks when business domains expand. Inference fails on unseen values or requires code redeployment.
Fix: Externalize category dictionaries to configuration stores or feature registries. Implement dynamic fallback buckets and monitor unknown category frequency for schema updates.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Bounded numerical features (e.g., percentages, ratios) | Min-Max Normalization | Preserves relative ordering within fixed range; simplifies interpretation | Low compute, stable memory |
| Gradient-based models (NN, Logistic Regression, SVM) | Z-Score Standardization | Aligns gradient magnitudes; accelerates convergence; preserves outlier structure | Moderate compute, faster training |
| High-cardinality categories (>100 unique values) | Target Encoding / Embeddings | Prevents dimensionality explosion; captures predictive signal efficiently | Higher training cost, lower inference latency |
| Real-time inference with strict latency SLA | Pre-computed feature cache + Min-Max | Eliminates runtime transformation overhead; deterministic output | Higher storage cost, predictable latency |
| Exploratory analysis with mixed data types | Unified Z-Score pipeline + One-Hot | Standardizes comparison across features; enables distance-based clustering | Moderate compute, scalable memory |
Configuration Template
// feature-pipeline.config.ts
import { FeatureConfig } from './pipeline.types';
export const USER_BEHAVIOR_SCHEMA: FeatureConfig[] = [
{
name: 'account_age_days',
type: 'numerical',
scalingMethod: 'zscore',
},
{
name: 'monthly_transaction_count',
type: 'numerical',
scalingMethod: 'minmax',
},
{
name: 'device_type',
type: 'categorical',
categories: ['mobile', 'desktop', 'tablet'],
fallbackIndex: 3,
},
{
name: 'subscription_tier',
type: 'categorical',
categories: ['free', 'basic', 'pro', 'enterprise'],
fallbackIndex: 4,
},
];
export const PIPELINE_VERSION = 'v2.1.0';
export const MAX_FEATURE_DIMENSION = 128;
export const UNKNOWN_CATEGORY_THRESHOLD = 0.05; // Alert if unknowns exceed 5%
Quick Start Guide
- Initialize Schema: Copy the configuration template and define your feature contract. Specify types, scaling methods, and category lists.
- Fit Pipeline: Load training dataset and call
pipeline.fit(trainingData). Store the serialized state in your feature registry.
- Transform Inference: On each prediction request, call
pipeline.transform(rawInput). Validate output dimensionality matches model expectations.
- Monitor Drift: Track feature distribution statistics in production. Trigger pipeline re-fitting when statistical tests detect significant shift beyond defined thresholds.
- Deploy Versioned Artifacts: Package schema, fitted state, and pipeline code together. Ensure inference services load matching versions to prevent contract mismatches.