I Built fintech-fraud-sim: A TypeScript CLI for Synthetic Fraud Testing Data

Current Situation Analysis

Fraud detection systems consistently fail during production rollout not because of algorithmic deficiencies, but because of inadequate test data. Engineering teams routinely validate risk engines, rules pipelines, and monitoring dashboards using either sanitized production dumps or randomly generated mock rows. Both approaches introduce critical blind spots.

Sanitized production data carries compliance overhead, lacks edge-case density, and rarely contains the precise behavioral cascades that trigger modern risk models. Random mock data, meanwhile, suffers from statistical flatness. Real fraud is not a single boolean flag; it is a correlated sequence of signals spanning identity verification, device topology, geographic drift, transaction velocity, and beneficiary networks. When test data treats these dimensions independently, rules engines produce false negatives, and machine learning models learn spurious correlations.

The industry overlooks this because generating correlated synthetic data requires a stateful generation engine. You cannot simply randomize amount and is_fraud in isolation. A legitimate account takeover scenario requires synchronized updates across multiple entities: a sudden device rotation, a spike in failed authentication attempts, a geographic mismatch between declared residence and IP origin, and a rapid increase in new beneficiary registrations. Without a unified generation context, these signals diverge, rendering the dataset statistically useless for validation.

Modern risk architectures evaluate 15+ features per event. Testing these systems demands datasets where user profiles, transaction histories, and device metadata evolve together under controlled fraud patterns. Synthetic generation that preserves cross-entity correlation is no longer optional; it is a prerequisite for reliable rules engine validation, model prototyping, and compliance-safe QA pipelines.

WOW Moment: Key Findings

The following comparison illustrates why correlated synthetic generation outperforms traditional mock data and sanitized production extracts across critical validation dimensions.

Approach	Pattern Coverage	Cross-Entity Consistency	Deterministic Repeatability	Compliance Overhead
Static Mock Rows	<15% (isolated flags)	Low (randomized independently)	None (non-deterministic)	High (requires PII scrubbing)
Sanitized Production	~60% (historical only)	Medium (drifts over time)	Low (snapshot-dependent)	High (legal review required)
Correlated Synthetic	100% (configurable patterns)	High (unified generation context)	Full (seed-controlled)	Zero (no PII generated)

This finding matters because it shifts fraud testing from reactive validation to proactive engineering. When datasets maintain temporal and relational consistency, teams can:

Validate rules engines against known pattern signatures without false positive noise
Prototype risk scoring models with statistically representative feature distributions
Run regression suites in CI/CD with identical datasets across environments
Eliminate legal bottlenecks by removing all personally identifiable information from the generation pipeline

Correlated synthetic data transforms fraud testing from a data hygiene exercise into a deterministic engineering discipline.

Core Solution

Building a reliable synthetic fraud dataset requires a generation engine that maintains state across entities, applies pattern-specific correlation rules, and exposes deterministic control points. The implementation follows a four-phase architecture: parameter configuration, pattern correlation, entity synchronization, and format serialization.

Phase 1: Parameter Configuration

Define generation scope through a structured configuration object. This decouples test intent from execution logic and enables environment-specific overrides.

interface FraudDatasetConfig {
  targetVolume: number;
  fraudProbability: number;
  activePatterns: FraudPattern[];
  seed: string | number;
  outputFormat: 'csv' | 'json';
  destinationPath: string;
}

type FraudPattern = 
  | 'mule_account'
  | 'account_takeover'
  | 'velocity_abuse'
  | 'kyc_abuse'
  | 'chargeback_risk'
  | 'transaction_spike'
  | 'cross_border_anomaly'
  | 'beneficiary_burst';

Phase 2: Pattern Correlation Engine

Each fraud pattern dictates a specific correlation matrix. The generator applies these rules during entity creation to ensure signals align across user profiles, transaction streams, and device metadata.

const patternCorrelationMap: Record<FraudPattern, CorrelationRule[]> = {
  account_takeover: [
    { source: 'user', field: 'device_count', condition: '>=', value: 3 },
    { source: 'user', field: 'failed_login_attempts_24h', condition: '>=', value: 5 },
    { source: 'user', field: 'ip_country', relation: 'mismatch', target: 'declared_country' },
    { source: 'transaction', field: 'is_suspicious', condition: '===', value: true }
  ],
  mule_account: [
    { source: 'user', field: 'account_age_days', condition: '<=', value: 7 },
    { source: 'user', field: 'beneficiary_count_24h', condition: '>=', value: 8 },
    { source: 'transaction', field: 'amount', condition: '>', target: 'user_baseline' }
  ]
  // Additional patterns follow identical correlation structures
};

Phase 3: Entity Synchronization

Users and transactions must share a unified generation context. The engine maintains a registry of beneficiaries, devices, and geographic mappings to prevent reference drift.

class SyntheticDataOrchestrator {
  private beneficiaryRegistry: Map<string, BeneficiaryProfile> = new Map();
  private deviceTopology: Map<string, DeviceSignature> = new Map();
  
  async generate(config: FraudDatasetConfig): Promise<GenerationOutput> {
    const rng = this.initializeDeterministicRng(config.seed);
    const users = await this.buildUserPool(config, rng);
    const transactions = await this.linkTransactionStream(users, config, rng);
    
    return {
      users: this.serialize(users, config.outputFormat),
      transactions: this.serialize(transactions, config.outputFormat),
      metadata: this.generateSummary(users, transactions)
    };
  }
}

Phase 4: Format Serialization & Validation

The output layer abstracts format-specific serialization while preserving schema integrity. CSV and JSON outputs share identical field definitions, enabling seamless ingestion into analytics pipelines, rules engines, or model training frameworks.

interface UserProfile {
  user_id: string;
  country: string;
  account_age_days: number;
  kyc_status: 'verified' | 'pending' | 'failed';
  failed_kyc_attempts: number;
  device_count: number;
  ip_country: string;
  declared_country: string;
  failed_login_attempts_24h: number;
  beneficiary_count_24h: number;
  chargeback_count: number;
  is_fraud: boolean;
  fraud_pattern: FraudPattern | null;
  risk_label: 'low' | 'medium' | 'high' | 'critical';
  reason_codes: string[];
}

interface TransactionRecord {
  transaction_id: string;
  user_id: string;
  timestamp: string;
  amount: number;
  currency: string;
  channel: 'web' | 'mobile' | 'api' | 'branch';
  beneficiary_id: string;
  beneficiary_country: string;
  device_id: string;
  ip_country: string;
  status: 'completed' | 'pending' | 'declined' | 'flagged';
  is_suspicious: boolean;
  fraud_pattern: FraudPattern | null;
  reason_codes: string[];
}

Architecture Rationale:

Deterministic Seeding: Enables regression testing by guaranteeing identical datasets across CI runs. Without this, rules engine validation becomes non-repeatable.
Pattern Aliasing: Accepts shorthand identifiers (e.g., mule for mule_account) to reduce CLI friction while maintaining strict internal schema validation.
Format Abstraction: Decouples generation logic from consumption requirements. Teams can switch between CSV for data warehouse ingestion and JSON for API-driven testing without modifying the core engine.
PII Exclusion: The generator intentionally omits names, emails, phone numbers, national IDs, and bank account details. This eliminates compliance review cycles and enables safe distribution across engineering teams.

Pitfall Guide

1. Decoupled Entity Generation

Explanation: Generating users and transactions in separate processes breaks relational integrity. Transaction user_id references may point to non-existent profiles, or beneficiary counts may not align with actual registration events. Fix: Use a unified generation context that maintains shared registries. All entities must be instantiated within the same deterministic seed scope.

2. Ignoring Temporal Decay

Explanation: Fraud likelihood correlates strongly with account maturity. New accounts exhibit different risk profiles than established ones. Static generation ignores this temporal dimension. Fix: Apply age-weighted probability matrices. Reduce fraud pattern activation for accounts older than 180 days unless simulating dormant account reactivation scenarios.

3. Static Fraud Rates Across Segments

Explanation: Applying a uniform fraud-rate across all channels, countries, and product types creates unrealistic distributions. Mobile API traffic typically shows different fraud baselines than web or branch channels. Fix: Implement tiered fraud probabilities based on channel, geography, and account age. Use configuration overrides to simulate regional risk spikes.

4. Missing Reason Code Propagation

Explanation: Setting is_suspicious: true without populating reason_codes or risk_label renders the dataset useless for rules engine validation. Modern risk systems require explicit signal attribution. Fix: Enforce a validation step that maps each fraud_pattern to its corresponding reason_codes array and risk_label before serialization.

5. Non-Deterministic CI Runs

Explanation: Omitting --seed in automated test suites causes dataset drift between pipeline executions. Rules engine assertions fail intermittently, masking genuine regressions. Fix: Always pin deterministic seeds in CI configurations. Use environment variables to rotate seeds across staging environments while maintaining reproducibility.

6. Overlapping Pattern Conflicts

Explanation: Assigning multiple fraud patterns to a single user creates contradictory signals. An account cannot simultaneously exhibit kyc_abuse and account_takeover without explicit multi-vector simulation logic. Fix: Implement mutually exclusive pattern assignment by default. Enable multi-pattern simulation only when explicitly configured, with clear priority resolution rules.

7. Beneficiary Reference Drift

Explanation: Transactions reference beneficiary_id values that were never registered in the user profile's beneficiary_count_24h field. This breaks network analysis and graph-based fraud detection tests. Fix: Maintain a centralized beneficiary registry during generation. Validate all transaction references against registered beneficiaries before output.

Production Bundle

Action Checklist

Pin deterministic seeds in all CI/CD pipelines to ensure regression test stability
Validate pattern correlation matrices against known fraud signatures before model training
Implement schema versioning for synthetic datasets to track feature evolution over time
Configure tiered fraud probabilities based on channel, geography, and account maturity
Enforce reason code propagation to maintain rules engine compatibility
Run cross-entity reference validation to prevent beneficiary and device drift
Tag all generated datasets with synthetic: true metadata to prevent accidental production ingestion

Decision Matrix

Scenario	Recommended Approach	Why	Cost Impact
Local QA & Dashboard Prototyping	CLI generation with `--format csv`	Fast iteration, zero setup, human-readable output	Minimal (developer time only)
CI/CD Regression Testing	Programmatic API with pinned `--seed`	Deterministic execution, integrates with test runners, enables assertions	Low (pipeline compute)
Model Training & Feature Engineering	JSON output with extended metadata	Structured parsing, preserves nested arrays, supports ML pipelines	Medium (storage + preprocessing)
Compliance-Sensitive Environments	Synthetic generation with PII exclusion	Eliminates legal review, enables cross-team sharing, reduces audit scope	High compliance savings
Multi-Vector Fraud Simulation	Custom pattern combination with priority resolution	Tests complex attack chains, validates graph-based detection	High (engineering effort)

Configuration Template

// fraud-sim.config.ts
import type { FraudDatasetConfig } from './types';

export const defaultConfig: FraudDatasetConfig = {
  targetVolume: 2500,
  fraudProbability: 0.08,
  activePatterns: [
    'account_takeover',
    'velocity_abuse',
    'cross_border_anomaly',
    'beneficiary_burst'
  ],
  seed: process.env.CI ? 'ci-regression-v1' : 'dev-local',
  outputFormat: 'json',
  destinationPath: './test-data/synthetic-fraud'
};

export const highRiskConfig: FraudDatasetConfig = {
  ...defaultConfig,
  targetVolume: 5000,
  fraudProbability: 0.15,
  activePatterns: ['mule_account', 'kyc_abuse', 'chargeback_risk'],
  seed: 'stress-test-2024-q3'
};

// package.json scripts
{
  "scripts": {
    "generate:qa": "fintech-fraud-sim generate --users 1000 --fraud-rate 0.08 --seed qa-suite --format csv --out ./fixtures",
    "generate:ci": "fintech-fraud-sim generate --users 5000 --fraud-rate 0.12 --seed ${CI_COMMIT_SHA} --format json --out ./ci-data",
    "validate:schema": "node scripts/validate-schemas.js ./ci-data",
    "test:rules": "jest --testMatch '**/rules-engine.test.ts'"
  }
}

# .github/workflows/fraud-validation.yml
name: Fraud Rules Validation
on: [push, pull_request]

jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 20
      - run: npm ci
      - run: npm run generate:ci
        env:
          CI_COMMIT_SHA: ${{ github.sha }}
      - run: npm run validate:schema
      - run: npm run test:rules

Quick Start Guide

Install the generator globally or as a dev dependency:
```
npm install --save-dev fintech-fraud-sim
```

Generate a baseline dataset with deterministic seeding:

npx fintech-fraud-sim generate --users 1000 --fraud-rate 0.08 --seed baseline-v1 --format json --out ./test-data

Validate schema integrity and cross-entity references:
```
node scripts/validate-schemas.js ./test-data
```

Integrate into your test suite using the generated fixtures:

import { readFileSync } from 'fs';
import { loadFraudDataset } from './dataset-loader';

const dataset = loadFraudDataset(readFileSync('./test-data/users.json', 'utf-8'));
expect(dataset.fraudPatterns.account_takeover.length).toBeGreaterThan(0);

Pin the seed in CI to ensure repeatable validation:

npx fintech-fraud-sim generate --users 2000 --fraud-rate 0.10 --seed ${GITHUB_SHA} --format csv

Synthetic fraud data generation transitions from an ad-hoc task to a repeatable engineering practice when correlation, determinism, and schema integrity are enforced at the architecture level. The patterns outlined here provide a foundation for building reliable, compliance-safe validation pipelines that scale with modern risk systems.