and appended to the offline store as a historical record. This ensures the online store always reflects the latest state while the offline store maintains a complete audit trail.
2. Feature Definitions as Code
Consistency is enforced by treating feature logic as a versioned artifact rather than pipeline-specific code. A feature definition must be executable in both batch and streaming contexts without modification. This eliminates the risk of divergence caused by reimplementing logic in SQL for batch and custom code for streaming.
// feature-registry.ts
export interface FeatureContext {
entityId: string;
timestamp: Date;
// Additional context available during computation
}
export interface FeatureDefinition<T> {
name: string;
version: string;
owner: string;
// Computation logic must be pure and context-aware
compute: (ctx: FeatureContext, rawEvents: any[]) => T;
// Cold start strategy embedded in definition
defaultValue: T;
// Metadata for governance
schema: string;
ttl?: number; // Time-to-live for online store
}
// Example: User Purchase Frequency
const userPurchaseFrequency: FeatureDefinition<number> = {
name: 'user_purchase_frequency_7d',
version: '1.0.0',
owner: 'payments-team',
defaultValue: 0,
ttl: 604800, // 7 days in seconds
schema: 'number',
compute: (ctx, events) => {
const cutoff = new Date(ctx.timestamp.getTime() - 7 * 24 * 60 * 60 * 1000);
return events
.filter(e => e.type === 'purchase' && e.status === 'completed' && e.ts > cutoff)
.length;
}
};
This definition is the single source of truth. The batch pipeline uses the compute function to generate training datasets, and the streaming pipeline uses the same function to update the online store. When the definition changes, both paths update atomically.
3. Batch Point Lookup API
Inference requests rarely require features for a single entity. A typical prediction involves resolving features for multiple entities (e.g., user, item, context) and multiple feature groups simultaneously. Implementing this as a sequence of individual reads introduces network round-trip overhead that violates latency budgets.
The serving API must support batch point lookups, retrieving all required features in a single operation.
// feature-serving.ts
export interface FeatureRequest {
entityId: string;
featureNames: string[];
}
export interface FeatureVector {
entityId: string;
values: Record<string, any>;
}
export class FeatureServingClient {
constructor(private onlineStore: OnlineStoreAdapter) {}
async getFeatureVectors(requests: FeatureRequest[]): Promise<FeatureVector[]> {
// Pipeline the requests to minimize round trips
const pipeline = this.onlineStore.pipeline();
const requestMap = new Map<string, FeatureRequest>();
requests.forEach(req => {
const key = this.buildEntityKey(req.entityId);
pipeline.hgetall(key);
requestMap.set(key, req);
});
const results = await pipeline.exec();
return results.map((result, index) => {
const key = this.buildEntityKey(requests[index].entityId);
const req = requestMap.get(key)!;
const rawValues = result as Record<string, string> | null;
// Map raw values to requested features, applying defaults if missing
const values = req.featureNames.reduce((acc, fname) => {
const def = this.registry.getDefinition(fname);
acc[fname] = rawValues?.[fname] !== undefined
? this.parseValue(rawValues[fname], def.schema)
: def.defaultValue;
return acc;
}, {} as Record<string, any>);
return { entityId: req.entityId, values };
});
}
private buildEntityKey(entityId: string): string {
return `feature:${entityId}`;
}
}
Architecture Rationale:
- Pipeline Execution: The
pipeline method batches network commands, reducing latency from O(N) round trips to O(1). This is critical for meeting the 5β20ms budget.
- Default Application: Missing values are handled by applying the
defaultValue from the feature definition. This ensures cold start entities receive valid inputs without requiring special logic in the model.
- Schema Parsing: Values are parsed according to the definition's schema, ensuring type safety between storage and model input.
4. Schema Versioning and Governance
Features evolve. A robust feature plane requires explicit versioning to prevent silent schema drift. Each feature definition includes a version string, and models are pinned to specific versions during training.
The registry maintains a catalog of all feature definitions, including ownership, description, and current production version. This enables discoverability and reuse, preventing teams from duplicating effort or creating conflicting feature definitions. When a feature definition is updated, a new version is created. Existing models continue to use the old version, while new models can be trained against the updated definition. This allows for safe rollouts and clean rollbacks.
Pitfall Guide
Production feature stores fail due to predictable architectural mistakes. The following pitfalls highlight common errors and their remedies.
-
Logic Duplication Trap
- Explanation: Implementing feature transformations separately in batch (e.g., SQL) and streaming (e.g., Flink/Spark) pipelines. Even minor differences in handling nulls, timezones, or edge cases cause training-serving skew.
- Fix: Enforce feature definitions as code. Use a shared library or framework that executes the same logic in both contexts. Validate consistency with automated tests comparing batch and stream outputs.
-
N+1 Read Pattern in Serving
- Explanation: Fetching features for each entity or feature group individually during inference. This multiplies network latency and quickly exceeds the serving budget.
- Fix: Implement batch point lookup APIs. Design the online store schema to support retrieving multiple features per entity in a single command (e.g., Redis Hashes).
-
Cold Start as an Afterthought
- Explanation: Returning null or zero for new entities without a defined strategy. Models trained on historical data may behave unpredictably when receiving nulls, leading to degraded predictions for new users or items.
- Fix: Embed cold start defaults in the feature definition. Use segment-specific priors or global averages based on business logic. Ensure the training pipeline also applies these defaults so the model learns to handle them.
-
Silent Schema Drift
- Explanation: Changing a feature's computation or type without updating the version or notifying consumers. Models receive inputs that no longer match their training distribution, causing performance decay.
- Fix: Require version bumps for all definition changes. Pin models to specific feature versions. Implement a governance workflow that reviews changes and validates impact before deployment.
-
Write Path Asymmetry
- Explanation: The online store updates immediately, but the offline store lags significantly due to batch windowing or pipeline delays. This creates a gap where training data is stale relative to production.
- Fix: Monitor replication lag between online and offline stores. Use idempotent writes to handle retries safely. For critical features, consider streaming writes to the offline store or reducing batch window sizes.
-
Over-Provisioning Online Storage
- Explanation: Storing historical feature values in the online store. This increases memory usage, raises costs, and complicates TTL management without benefiting inference.
- Fix: Restrict the online store to current state only. Use TTLs to automatically expire old values. Rely on the offline store for historical queries.
-
Ignoring Cross-Feature Dependencies
- Explanation: Features that depend on other features may be computed out of order or with stale dependencies, leading to inconsistent vectors.
- Fix: Define dependency graphs in the feature registry. Ensure the computation pipeline respects topological order. Use transactional writes where supported to maintain consistency across dependent features.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| High Throughput, Sub-10ms Latency | Redis Cluster or Aerospike | In-memory storage with O(1) access and pipelining support. | High memory cost; requires cluster management. |
| Managed Service, Variable Load | DynamoDB | Serverless scaling, strong consistency options, low ops overhead. | Pay-per-request; can be expensive at high read volumes. |
| Strict ACID, Low Volume | PostgreSQL | Familiar tooling, relational integrity, easy backups. | Limited scalability; higher latency than KV stores. |
| Analytical Heavy, Batch Training | S3 + Parquet / BigQuery | Columnar compression, cost-effective storage, SQL interface. | Low storage cost; higher compute cost for queries. |
| Complex Feature Dependencies | Graph-based Registry | Tracks dependencies, ensures correct computation order. | Increased complexity in pipeline orchestration. |
Configuration Template
Use this YAML template to define features in the registry. This structure supports versioning, governance, and cold start handling.
# feature-definitions/user_engagement.yaml
features:
- name: user_session_count_30d
version: "2.1.0"
owner: "growth-team"
description: "Number of active sessions in the last 30 days."
schema: "integer"
ttl: 2592000 # 30 days
# Cold start strategy
default_value: 0
default_strategy: "global_prior"
# Computation metadata
computation:
type: "aggregation"
window: "30d"
filters:
- "event_type == 'session_start'"
- "duration > 10s"
# Dependencies
depends_on:
- "user_segment"
# Governance
tags:
- "engagement"
- "user-level"
audit_log:
created: "2023-10-01"
last_modified: "2024-01-15"
modified_by: "alice@company.com"
Quick Start Guide
- Provision Stores: Deploy an online store (e.g., Redis) and an offline store (e.g., S3 bucket). Configure network access and authentication.
- Define Features: Create feature definitions using the registry template. Implement the computation logic in a shared library.
- Initialize Offline Store: Run a batch job to populate the offline store with historical feature values. Validate data quality and completeness.
- Stream Updates: Configure the streaming pipeline to compute features in real-time and write updates to the online store. Ensure writes are idempotent.
- Validate Serving: Test the batch lookup API with synthetic requests. Measure latency and verify that feature vectors match expected values. Monitor for skew and lag.