candidate value and the identifier holding it. Both are normalized before processing to eliminate case sensitivity and trailing artifacts that would otherwise introduce noise into the feature space.
interface FeatureVector {
dimensions: number;
values: Float32Array;
}
function buildCredentialFeatures(
rawValue: string,
identifier: string
): FeatureVector {
const normalizedValue = rawValue.trim();
const normalizedId = identifier.toLowerCase().replace(/[_\-]/g, '');
const vector = new Float32Array(26);
let index = 0;
// Group 1: Statistical Properties (4)
vector[index++] = computeShannonEntropy(normalizedValue);
vector[index++] = Math.log(normalizedValue.length + 1);
vector[index++] = computeUniquenessRatio(normalizedValue);
vector[index++] = computeMaxRunNormalized(normalizedValue);
// Group 2: Character Composition (8)
const charProfile = analyzeCharacterDistribution(normalizedValue);
vector[index++] = charProfile.upper;
vector[index++] = charProfile.lower;
vector[index++] = charProfile.numeric;
vector[index++] = charProfile.special;
vector[index++] = charProfile.hexadecimal;
vector[index++] = charProfile.base64Safe;
vector[index++] = charProfile.printable;
vector[index++] = charProfile.whitespace;
// Group 3: Semantic Context (1)
vector[index++] = evaluateIdentifierRisk(normalizedId);
// Group 4: Format Signatures (16)
const signatureFlags = matchKnownFormats(normalizedValue);
for (let i = 0; i < 16; i++) {
vector[index++] = signatureFlags[i] ? 1.0 : 0.0;
}
return { dimensions: 26, values: vector };
}
Step 2: Mathematical Feature Design
Entropy & Randomness Metrics
Shannon entropy quantifies character unpredictability. Cryptographic secrets cluster between 5.7 and 6.0 bits, while human-chosen strings rarely exceed 4.5. We compute it using standard frequency distribution:
function computeShannonEntropy(input: string): number {
const freq = new Map<string, number>();
for (const char of input) {
freq.set(char, (freq.get(char) || 0) + 1);
}
const len = input.length;
let entropy = 0;
for (const count of freq.values()) {
const p = count / len;
entropy -= p * Math.log2(p);
}
return entropy;
}
Raw length is log-scaled to prevent dimension dominance. A 200-character JWT and a 20-character API key should not create a 10x magnitude gap in the feature space. Log transformation compresses the range while preserving relative ordering.
The uniqueness ratio (uniqueChars / totalChars) and normalized longest run both penalize repetitive patterns. Real secrets avoid predictable character repetition. These two metrics work inversely: high uniqueness + low longest run strongly indicates machine-generated randomness.
Character Composition Profiling
Eight ratio features capture the "shape" of the string. Instead of counting absolute occurrences, we normalize by string length to maintain scale invariance. The hexadecimal and base64-safe ratios are particularly valuable for disambiguation. SHA-256 hashes yield a hex ratio of 1.0, while JWTs and API keys typically show mixed profiles. Special character ratios help flag human-chosen passwords, which often inject symbols for complexity requirements.
Semantic Context Scoring
The identifier risk feature uses a weighted substring lookup. Rather than exact matching, we scan for risk keywords across the normalized variable name. Unknown identifiers receive a moderate baseline (0.3) to avoid penalizing legitimate but unnamed secrets. This single feature consistently ranks highest in permutation importance tests because it directly encodes developer intent.
Format Signature Flags
Sixteen binary flags act as hard priors. When a value matches a known provider prefix or structural pattern, the corresponding flag activates. These flags don't override the classifier; they shift the probability distribution. The model learns that pattern_aws_access_key = 1 combined with key_name_risk = 0.0 still warrants investigation, while pattern_hex_key_64 = 1 with key_name_risk = 0.0 likely indicates a checksum.
Architecture Rationale
Why 26 features? Dimensionality directly impacts Random Forest training time and memory footprint. Beyond 30-40 features, marginal gains diminish while overfitting risk increases. The 4-8-1-16 split balances statistical, compositional, contextual, and structural signals without redundancy. Random Forests handle mixed data types natively, require no feature scaling, and provide built-in importance metrics—making them ideal for security tooling where interpretability matters more than marginal accuracy improvements from gradient boosting or neural networks.
Pitfall Guide
1. Entropy-Only Detection
Explanation: Relying solely on Shannon entropy causes massive false positive rates. UUIDs, SHA-256 digests, base64-encoded images, and compressed payloads all exhibit high randomness.
Fix: Always pair entropy with character distribution and identifier context. Use entropy as a filtering layer, not a classification decision.
2. Ignoring Variable Naming Conventions
Explanation: Stripping context before analysis removes the strongest predictive signal. A 32-character hex string is harmless as file_hash but critical as db_master_key.
Fix: Preserve and normalize the identifier alongside the value. Implement substring risk scoring rather than exact matching to catch variations like AWS_SECRET_ACCESS_KEY_V2.
3. Hardcoding Regex Thresholds Inside the Pipeline
Explanation: Embedding strict regex conditions directly into feature extraction creates brittle logic that breaks when providers rotate key formats.
Fix: Keep pattern matching declarative and externalized. Store regex signatures in a versioned configuration file. Let the classifier learn interaction weights instead of hardcoding decision boundaries.
4. Feature Leakage Through Overlapping Metrics
Explanation: Computing multiple features that measure the same underlying property (e.g., uppercase_ratio and alpha_ratio) introduces multicollinearity, confusing tree splits and inflating variance.
Fix: Audit feature correlations during training. Remove or merge highly correlated dimensions. Prefer orthogonal signals: randomness, composition, context, and structure.
5. Normalization Drift in Production
Explanation: Inconsistent handling of whitespace, case, or encoding artifacts causes the same secret to produce different vectors across runs, degrading model stability.
Fix: Enforce strict normalization at the pipeline boundary. Trim, lowercase identifiers, and validate UTF-8 encoding before any feature computation. Log normalization events for audit trails.
6. Over-Engineering Dimensionality
Explanation: Adding more features under the assumption that "more data equals better accuracy" increases inference latency and training complexity without improving recall.
Fix: Use permutation importance or SHAP values to prune low-impact features. Maintain a lean vector that maximizes signal-to-noise ratio. Re-evaluate quarterly as credential formats evolve.
7. Treating Pattern Flags as Absolute Truths
Explanation: Assuming a matched regex flag guarantees a secret ignores provider-specific edge cases and test fixtures.
Fix: Treat pattern flags as strong priors, not verdicts. The classifier should weigh them against entropy and context. Implement confidence thresholds that require multiple signals to align before triggering alerts.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Pre-commit hooks | 26-Feature Random Forest | Sub-2ms latency, no external dependencies, fully offline | Near-zero infrastructure cost |
| Enterprise-wide audit | LLM-assisted semantic scanning | Handles obfuscated, multi-line, or context-heavy secrets | High compute cost, requires GPU/LLM API |
| Legacy codebase migration | Regex + entropy hybrid | Fast deployment, catches known provider formats immediately | Moderate false positives, requires manual triage |
| Compliance reporting | Feature vector + SHAP analysis | Provides auditable feature importance and decision trails | Low cost, requires model serialization |
Configuration Template
{
"feature_pipeline": {
"version": "2.1.0",
"normalization": {
"trim_whitespace": true,
"lowercase_identifiers": true,
"strip_separators": ["_", "-", "."]
},
"risk_keywords": {
"critical": ["password", "secret", "private_key", "privkey", "credential"],
"high": ["api_key", "token", "auth_token", "access_key", "bearer"],
"moderate": ["key", "auth", "login", "session"],
"low": ["config", "setting", "value", "parameter"],
"benign": ["checksum", "hash", "version", "id", "uuid", "color"]
},
"pattern_signatures": {
"aws_access_key": "^AKIA[0-9A-Z]{16}$",
"github_pat": "^gh[pousr]_[A-Za-z0-9]{36}$",
"github_fine_grained": "^github_pat_[A-Za-z0-9]{82}$",
"jwt": "^eyJ[A-Za-z0-9_-]+\\.[A-Za-z0-9_-]+\\.[A-Za-z0-9_-]+$",
"openai_key": "^sk-[A-Za-z0-9]{48}$",
"slack_token": "^xox[baprs]-[A-Za-z0-9-]+$",
"stripe_secret": "^sk_live_[A-Za-z0-9]{24}$",
"stripe_publishable": "^pk_live_[A-Za-z0-9]{24}$",
"google_api": "^AIza[0-9A-Za-z_-]{35}$",
"heroku_api": "^[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}$",
"private_key_header": "^-----BEGIN (RSA |EC |DSA )?PRIVATE KEY-----",
"db_connection": "^(postgresql|mysql|mongodb|redis)://",
"basic_auth": "^[A-Za-z0-9+/]{20,}={0,2}$",
"bearer_token": "^Bearer [A-Za-z0-9-._~+/]+=*$",
"hex_key_32": "^[0-9a-f]{32}$",
"hex_key_64": "^[0-9a-f]{64}$"
},
"classification": {
"model_type": "random_forest",
"confidence_threshold": 0.82,
"max_features": 26,
"n_estimators": 150,
"min_samples_split": 5
}
}
}
Quick Start Guide
- Install dependencies: Add
@codcompass/feature-extractor and ml-random-forest to your project. Ensure Node.js 18+ or TypeScript 5.0+ is configured.
- Initialize the pipeline: Import the configuration template, instantiate the feature extractor, and load the serialized Random Forest model (
.json or .onnx format).
- Integrate into CI: Add a pre-commit hook or GitHub Action that scans staged files. Pass each candidate string and its surrounding identifier to
buildCredentialFeatures(), then invoke model.predict(vector.values).
- Set enforcement thresholds: Start with
confidence_threshold: 0.75 in shadow mode. Review false positives for one sprint, then tighten to 0.82 before enabling blocking behavior.
- Monitor drift: Log feature importance scores monthly. If
key_name_risk drops below 0.20 or pattern flags trigger excessively, update the signature configuration and retrain with fresh labeled data.