I Built fintech-fraud-sim: A TypeScript CLI for Synthetic Fraud Testing Data
Current Situation Analysis
Fraud detection systems consistently fail during production rollout not because of algorithmic deficiencies, but because of inadequate test data. Engineering teams routinely validate risk engines, rules pipelines, and monitoring dashboards using either sanitized production dumps or randomly generated mock rows. Both approaches introduce critical blind spots.
Sanitized production data carries compliance overhead, lacks edge-case density, and rarely contains the precise behavioral cascades that trigger modern risk models. Random mock data, meanwhile, suffers from statistical flatness. Real fraud is not a single boolean flag; it is a correlated sequence of signals spanning identity verification, device topology, geographic drift, transaction velocity, and beneficiary networks. When test data treats these dimensions independently, rules engines produce false negatives, and machine learning models learn spurious correlations.
The industry overlooks this because generating correlated synthetic data requires a stateful generation engine. You cannot simply randomize amount and is_fraud in isolation. A legitimate account takeover scenario requires synchronized updates across multiple entities: a sudden device rotation, a spike in failed authentication attempts, a geographic mismatch between declared residence and IP origin, and a rapid increase in new beneficiary registrations. Without a unified generation context, these signals diverge, rendering the dataset statistically useless for validation.
Modern risk architectures evaluate 15+ features per event. Testing these systems demands datasets where user profiles, transaction histories, and device metadata evolve together under controlled fraud patterns. Synthetic generation that preserves cross-entity correlation is no longer optional; it is a prerequisite for reliable rules engine validation, model prototyping, and compliance-safe QA pipelines.
WOW Moment: Key Findings
The following comparison illustrates why correlated synthetic generation outperforms traditional mock data and sanitized production extracts across critical validation dimensions.
| Approach | Pattern Coverage | Cross-Entity Consistency | Deterministic Repeatability | Compliance Overhead |
|---|---|---|---|---|
| Static Mock Rows | <15% (isolated flags) | Low (randomized independently) | None (non-deterministic) | High (requires PII scrubbing) |
| Sanitized Production | ~60% (historical only) | Medium (drifts over time) | Low (snapshot-dependent) | High (legal review required) |
| Correlated Synthetic | 100% (configurable patterns) | High (unified generation context) | Full (seed-controlled) | Zero (no PII generated) |
This finding matters because it shifts fraud testing from reactive validation to proactive engineering. When datasets maintain temporal and relational consistency, teams can:
- Validate rules engines against known pattern signatures without false positive noise
- Prototype risk scoring models with statistically representative feature distributions
- Run regression suites in CI/CD with identical datasets across environments
- Eliminate legal bottlenecks by removing all personally identifiable information from the generation pipeline
Correlated synthetic data transforms fraud testing from a data hygiene exercise into a deterministic engineering discipline.
Core Solution
Building a reliable synthetic fraud dataset requires a generation engine that maintains state across entities, applies pattern-specific correlation rules, and exposes deterministic control points. The implementation follows a four-phase architecture: parameter configuration, pattern correlation, entity synchronization, and format serialization.
Phase 1: Parameter Configuration
Define generation scope through a structured configuration object. This decouples test intent from execution logic and enables environment-specific overrides.
interface FraudDatasetConfig {
targetVolume: number;
fraudProbability: number;
activePatterns: FraudPattern[];
seed: string | number;
outputFormat: 'csv' | 'json';
destinationPath: string;
}
type FraudPattern =
| 'mule_account'
| 'account_takeover'
| 'velocity_abuse'
| 'kyc_abuse'
| 'chargeback_risk'
| 'transaction_spike'
| 'cross_border_anomaly'
| 'beneficiary_burst';
Phase 2: Pattern Correlation Engine
Each fraud pattern dictates a specific correlation matrix. The generator applies these rules during entity creation to ensure signals align across user profiles, transaction streams, and device metadata.
const patternCorrelationMap: Record<FraudPattern, CorrelationRule[]> = {
account_takeover: [
{ source: 'user', field: 'device_count', condition: '>=', value: 3 },
{ source: 'user', field: 'failed_login_attempts_24h', condition: '>=', value: 5 },
{ source: 'user', field: 'ip_country', relation: 'mismatch', target: 'declared_country' },
{ source: 'transaction', field: 'is_suspicious', condition: '===', value: true }
],
mule_account: [
{ source: 'user', field: 'account_age_days', condition: '<=', value: 7 },
{ source: 'user', field: 'beneficiary_count_24h', condition: '>=', value: 8 },
{ source: 'transaction', field: 'amount', condition: '>', target: 'user_baseline' }
]
// Additional patterns follow identical correlation structures
};
Phase 3: Entity Synchronization
Users and transactions must share a unified generation context. The engine maintains a registry of beneficiaries, devices, and geographic mappings to prevent reference drift.
class SyntheticDataOrchestrator {
private beneficiaryRegistry: Map<string, BeneficiaryProfile> = new Map();
private deviceTopology: Map<string, DeviceSignature> = new Map();
async generate(config: FraudDatasetConfig): Promise<GenerationOutput> {
const rng = this.initializeDeterministicRng(config.seed);
const users = await this.buildUserPool(config, rng);
const transactions = await this.linkTransactionStream(users, config, rng);
return {
users: this.serialize(users, config.outputFormat),
transactions: this.serialize(transactions, config.outputFormat),
metadata: this.generateSummary(users, transactions)
};
}
}
Phase 4: Format Serialization & Validation
The output layer abstracts format-specific serialization while preserving schema integrity. CSV and JSON outputs share identical field definitions, enabling seamless ingestion into analytics pipelines, rules engines, or model training frameworks.
interface UserProfile {
user_id: string;
country: string;
account_age_days: number;
kyc_status: 'verified' | 'pending' | 'failed';
failed_kyc_attempts: number;
device_count: number;
ip_country: string;
declared_country: string;
failed_login_attempts_24h: number;
beneficiary_count_24h: number;
chargeback_count: number;
is_fraud: boolean;
fraud_pattern: FraudPattern | null;
risk_label: 'low' | 'medium' | 'high' | 'critical';
reason_codes: string[];
}
interface TransactionRecord {
transaction_id: string;
user_id: string;
timestamp: string;
amount: number;
currency: string;
channel: 'web' | 'mobile' | 'api' | 'branch';
beneficiary_id: string;
beneficiary_country: string;
device_id: string;
ip_country: string;
status: 'completed' | 'pending' | 'declined' | 'flagged';
is_suspicious: boolean;
fraud_pattern: FraudPattern | null;
reason_codes: string[];
}
Architecture Rationale:
- Deterministic Seeding: Enables regression testing by guaranteeing identical datasets across CI runs. Without this, rules engine validation becomes non-repeatable.
- Pattern Aliasing: Accepts shorthand identifiers (e.g.,
muleformule_account) to reduce CLI friction while maintaining strict internal schema validation. - Format Abstraction: Decouples generation logic from consumption requirements. Teams can switch between CSV for data warehouse ingestion and JSON for API-driven testing without modifying the core engine.
- PII Exclusion: The generator intentionally omits names, emails, phone numbers, national IDs, and bank account details. This eliminates compliance review cycles and enables safe distribution across engineering teams.
Pitfall Guide
1. Decoupled Entity Generation
Explanation: Generating users and transactions in separate processes breaks relational integrity. Transaction user_id references may point to non-existent profiles, or beneficiary counts may not align with actual registration events.
Fix: Use a unified generation context that maintains shared registries. All entities must be instantiated within the same deterministic seed scope.
2. Ignoring Temporal Decay
Explanation: Fraud likelihood correlates strongly with account maturity. New accounts exhibit different risk profiles than established ones. Static generation ignores this temporal dimension. Fix: Apply age-weighted probability matrices. Reduce fraud pattern activation for accounts older than 180 days unless simulating dormant account reactivation scenarios.
3. Static Fraud Rates Across Segments
Explanation: Applying a uniform fraud-rate across all channels, countries, and product types creates unrealistic distributions. Mobile API traffic typically shows different fraud baselines than web or branch channels.
Fix: Implement tiered fraud probabilities based on channel, geography, and account age. Use configuration overrides to simulate regional risk spikes.
4. Missing Reason Code Propagation
Explanation: Setting is_suspicious: true without populating reason_codes or risk_label renders the dataset useless for rules engine validation. Modern risk systems require explicit signal attribution.
Fix: Enforce a validation step that maps each fraud_pattern to its corresponding reason_codes array and risk_label before serialization.
5. Non-Deterministic CI Runs
Explanation: Omitting --seed in automated test suites causes dataset drift between pipeline executions. Rules engine assertions fail intermittently, masking genuine regressions.
Fix: Always pin deterministic seeds in CI configurations. Use environment variables to rotate seeds across staging environments while maintaining reproducibility.
6. Overlapping Pattern Conflicts
Explanation: Assigning multiple fraud patterns to a single user creates contradictory signals. An account cannot simultaneously exhibit kyc_abuse and account_takeover without explicit multi-vector simulation logic.
Fix: Implement mutually exclusive pattern assignment by default. Enable multi-pattern simulation only when explicitly configured, with clear priority resolution rules.
7. Beneficiary Reference Drift
Explanation: Transactions reference beneficiary_id values that were never registered in the user profile's beneficiary_count_24h field. This breaks network analysis and graph-based fraud detection tests.
Fix: Maintain a centralized beneficiary registry during generation. Validate all transaction references against registered beneficiaries before output.
Production Bundle
Action Checklist
- Pin deterministic seeds in all CI/CD pipelines to ensure regression test stability
- Validate pattern correlation matrices against known fraud signatures before model training
- Implement schema versioning for synthetic datasets to track feature evolution over time
- Configure tiered fraud probabilities based on channel, geography, and account maturity
- Enforce reason code propagation to maintain rules engine compatibility
- Run cross-entity reference validation to prevent beneficiary and device drift
- Tag all generated datasets with
synthetic: truemetadata to prevent accidental production ingestion
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|---|---|---|
| Local QA & Dashboard Prototyping | CLI generation with --format csv |
Fast iteration, zero setup, human-readable output | Minimal (developer time only) |
| CI/CD Regression Testing | Programmatic API with pinned --seed |
Deterministic execution, integrates with test runners, enables assertions | Low (pipeline compute) |
| Model Training & Feature Engineering | JSON output with extended metadata | Structured parsing, preserves nested arrays, supports ML pipelines | Medium (storage + preprocessing) |
| Compliance-Sensitive Environments | Synthetic generation with PII exclusion | Eliminates legal review, enables cross-team sharing, reduces audit scope | High compliance savings |
| Multi-Vector Fraud Simulation | Custom pattern combination with priority resolution | Tests complex attack chains, validates graph-based detection | High (engineering effort) |
Configuration Template
// fraud-sim.config.ts
import type { FraudDatasetConfig } from './types';
export const defaultConfig: FraudDatasetConfig = {
targetVolume: 2500,
fraudProbability: 0.08,
activePatterns: [
'account_takeover',
'velocity_abuse',
'cross_border_anomaly',
'beneficiary_burst'
],
seed: process.env.CI ? 'ci-regression-v1' : 'dev-local',
outputFormat: 'json',
destinationPath: './test-data/synthetic-fraud'
};
export const highRiskConfig: FraudDatasetConfig = {
...defaultConfig,
targetVolume: 5000,
fraudProbability: 0.15,
activePatterns: ['mule_account', 'kyc_abuse', 'chargeback_risk'],
seed: 'stress-test-2024-q3'
};
// package.json scripts
{
"scripts": {
"generate:qa": "fintech-fraud-sim generate --users 1000 --fraud-rate 0.08 --seed qa-suite --format csv --out ./fixtures",
"generate:ci": "fintech-fraud-sim generate --users 5000 --fraud-rate 0.12 --seed ${CI_COMMIT_SHA} --format json --out ./ci-data",
"validate:schema": "node scripts/validate-schemas.js ./ci-data",
"test:rules": "jest --testMatch '**/rules-engine.test.ts'"
}
}
# .github/workflows/fraud-validation.yml
name: Fraud Rules Validation
on: [push, pull_request]
jobs:
validate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: 20
- run: npm ci
- run: npm run generate:ci
env:
CI_COMMIT_SHA: ${{ github.sha }}
- run: npm run validate:schema
- run: npm run test:rules
Quick Start Guide
Install the generator globally or as a dev dependency:
npm install --save-dev fintech-fraud-simGenerate a baseline dataset with deterministic seeding:
npx fintech-fraud-sim generate --users 1000 --fraud-rate 0.08 --seed baseline-v1 --format json --out ./test-dataValidate schema integrity and cross-entity references:
node scripts/validate-schemas.js ./test-dataIntegrate into your test suite using the generated fixtures:
import { readFileSync } from 'fs'; import { loadFraudDataset } from './dataset-loader'; const dataset = loadFraudDataset(readFileSync('./test-data/users.json', 'utf-8')); expect(dataset.fraudPatterns.account_takeover.length).toBeGreaterThan(0);Pin the seed in CI to ensure repeatable validation:
npx fintech-fraud-sim generate --users 2000 --fraud-rate 0.10 --seed ${GITHUB_SHA} --format csv
Synthetic fraud data generation transitions from an ad-hoc task to a repeatable engineering practice when correlation, determinism, and schema integrity are enforced at the architecture level. The patterns outlined here provide a foundation for building reliable, compliance-safe validation pipelines that scale with modern risk systems.
