Deterministic Detection, Generative Reporting: A Hybrid Architecture for Auditable SOC Triage

Current Situation Analysis

Security operations teams are rapidly integrating large language models into triage workflows, but a fundamental architectural mismatch is causing reliability failures. The industry treats LLMs as autonomous decision engines, feeding them raw telemetry and expecting consistent, auditable verdicts. In practice, this approach introduces probabilistic behavior into a domain that demands deterministic certainty.

The core problem is context drift and prompt sensitivity. Feed identical log batches to the same model with minor temperature variations or different system prompts, and you will receive divergent severity ratings, conflicting MITRE ATT&CK mappings, and inconsistent next-step recommendations. For compliance, incident response, and executive reporting, this variability is unacceptable. Auditors require reproducible findings. Engineers require version-controlled logic. LLMs provide neither when tasked with detection.

This issue is frequently misunderstood because early demos mask the underlying fragility. Vendors showcase polished chat interfaces that appear to "analyze" logs, but the heavy lifting is often done by hidden deterministic parsers or heavily constrained output schemas. When teams attempt to replicate this by prompting raw logs directly, they encounter predictable failure modes: hallucinated event IDs, misclassification of benign system processes (e.g., flagging svchost.exe as a living-off-the-land binary), malformed JSON responses that break downstream parsers, and context window exhaustion during high-volume ingestion.

The data-backed reality is straightforward. LLMs excel at natural language generation and summarization. They struggle with stateful counting, threshold evaluation, and strict logical branching. Security triage requires all three. The solution is not to abandon AI, but to reposition it. By decoupling detection logic from report generation, teams can achieve 100% reproducibility for findings while leveraging generative models exclusively for analyst-facing documentation.

WOW Moment: Key Findings

The architectural split between deterministic detection and generative reporting fundamentally changes how security pipelines behave in production. The following comparison illustrates the operational impact of treating AI as a writer rather than a detector.

Approach	Consistency	Auditability	False Positive Rate	Implementation Complexity
AI-Driven Detection	~65-80% (varies by model/temp)	Low (prompt-dependent)	High (hallucination/context drift)	Low initially, high maintenance
Deterministic + AI Reporting	100%	High (code-reviewed rules)	Low (tuned thresholds)	Moderate upfront, low maintenance

This finding matters because it shifts the security pipeline from a black-box experiment to a version-controlled engineering discipline. Deterministic rules can be unit-tested, peer-reviewed, and tracked in Git. Risk scores become mathematical functions rather than model guesses. The AI layer becomes a pluggable output formatter that can be swapped between providers (e.g., Claude, GPT-4, local Ollama instances) without altering the underlying findings or compliance mappings. Teams gain predictable triage, reduced alert fatigue, and a clear separation between signal generation and narrative delivery.

Core Solution

Building a hybrid triage pipeline requires strict layering. Each stage must have a single responsibility, explicit input/output contracts, and zero cross-contamination between logic and language generation.

Step 1: Unified Event Schema

Security telemetry arrives in fragmented formats. Windows Security logs, Sysmon events, Linux authentication records, and cloud audit trails use different field names, timestamp formats, and severity taxonomies. The first architectural decision is to normalize all inputs into a strict intermediate representation before any evaluation occurs.

interface NormalizedEvent {
  eventId: string;
  timestamp: Date;
  sourceIp: string | null;
  destinationIp: string | null;
  user: string;
  processName: string;
  commandLine: string | null;
  rawProvider: 'windows-security' | 'sysmon' | 'linux-auth' | 'generic-json';
}

Normalization strips provider-specific noise and aligns fields to a common vocabulary. This enables rules to evaluate events without conditional branching for log source types.

Step 2: Deterministic Rule Engine

Detection logic must be pure functions. Each rule accepts an array of normalized events, evaluates stateful conditions, and returns an array of evidence strings. Empty arrays indicate no match. This design enables unit testing, deterministic execution, and clear audit trails.

type Evidence = string;

interface DetectionRule {
  id: string;
  title: string;
  severity: 'low' | 'medium' | 'high' | 'critical';
  mitreTechniques: string[];
  evaluate(events: NormalizedEvent[]): Evidence[];
}

const consecutiveFailedLoginsFollowedBySuccess: DetectionRule = {
  id: 'auth-brute-force-success-chain',
  title: 'Successful authentication following repeated failures',
  severity: 'critical',
  mitreTechniques: ['T1110', 'T1078'],
  evaluate(events) {
    const failureCounts = new Map<string, number>();
    
    events.forEach(evt => {
      if (evt.eventId === 'auth-failure' && evt.sourceIp) {
        failureCounts.set(evt.sourceIp, (failureCounts.get(evt.sourceIp) ?? 0) + 1);
      }
    });

    const evidence: Evidence[] = [];
    events.forEach(evt => {
      if (evt.eventId === 'auth-success' && evt.sourceIp) {
        const failures = failureCounts.get(evt.sourceIp) ?? 0;
        if (failures >= 5) {
          evidence.push(
            `Account "${evt.user}" authenticated successfully from ${evt.sourceIp} after ${failures} consecutive failures.`
          );
        }
      }
    });

    return evidence;
  }
};

The rule counts failures per source IP, then scans for subsequent successes. The threshold (>= 5) is explicit, testable, and version-controlled. Background scanning noise is filtered because isolated failures without a follow-up success do not trigger the rule.

Step 3: Risk Scoring & Aggregation

Severity labels are insufficient for triage prioritization. A deterministic scoring algorithm aggregates rule matches, applies weight multipliers, and produces a normalized 0-100 risk index.

function calculateRiskScore(
  matchedRules: DetectionRule[],
  eventVolume: number
): number {
  const severityWeights: Record<string, number> = {
    low: 10,
    medium: 25,
    high: 50,
    critical: 80
  };

  let rawScore = 0;
  matchedRules.forEach(rule => {
    rawScore += severityWeights[rule.severity];
  });

  // Diminishing returns for high event volume to prevent score inflation
  const volumeModifier = Math.min(1.0, 1 + Math.log10(eventVolume) * 0.15);
  const finalScore = Math.min(100, Math.round(rawScore * volumeModifier));

  return finalScore;
}

This function ensures that multiple low-severity matches do not artificially inflate risk, while critical findings dominate the score. The logarithmic volume modifier prevents alert storms from skewing prioritization.

Step 4: Generative Reporting Layer

The AI component sits at the terminal stage. It receives the structured findings, risk score, and MITRE mappings, then generates analyst-ready prose. The prompt template is strict, and the model is treated as a text formatter, not a logic engine.

interface TriageReportInput {
  riskScore: number;
  findings: { ruleId: string; title: string; evidence: string[] }[];
  timeWindow: string;
}

function buildReportPrompt(input: TriageReportInput): string {
  return `
    Generate a concise security triage summary based on the following structured findings.
    Do not invent events, modify severity, or alter MITRE mappings.
    
    Risk Score: ${input.riskScore}/100
    Time Window: ${input.timeWindow}
    Findings:
    ${input.findings.map(f => `- ${f.title} (${f.ruleId})\n  Evidence: ${f.evidence.join('; ')}`).join('\n')}
    
    Output format:
    1. Executive Summary (2-3 sentences)
    2. Prioritized Findings (bullet list with next steps)
    3. MITRE ATT&CK Context (brief mapping explanation)
  `;
}

Swapping the underlying provider requires only changing the HTTP client or SDK call. The findings, scores, and technique IDs remain identical. This abstraction enables cost optimization, latency tuning, and compliance alignment without rewriting detection logic.

Pitfall Guide

1. Feeding Raw Logs Directly to LLMs

Explanation: Raw telemetry contains provider-specific formatting, timestamps, and noise that consume context windows and confuse pattern recognition. LLMs will attempt to parse structure they were not trained to validate. Fix: Always normalize logs into a strict intermediate schema before evaluation. Use deterministic parsers for EVTX, Sysmon, and auth.log formats.

2. Using Prompts for Stateful Logic

Explanation: LLMs lack persistent memory across inference calls. Attempting to count failures, track time windows, or evaluate thresholds via prompts results in inconsistent outputs and hidden state bugs. Fix: Move all counting, aggregation, and threshold evaluation to TypeScript. Treat the model as a stateless text generator.

3. Ignoring Schema Normalization

Explanation: Rules that check for event_id in Windows logs but eventType in Linux logs will fail silently or produce false negatives. Inconsistent field naming breaks deterministic evaluation. Fix: Define a canonical event interface. Write adapter functions for each log source that map provider fields to the canonical schema before rules execute.

4. Skipping Rule Unit Tests

Explanation: Security rules drift when thresholds change or new log formats are introduced. Without tests, regressions go unnoticed until production incidents occur. Fix: Maintain a synthetic event dataset covering edge cases. Run rules through a test harness that verifies evidence output, severity assignment, and MITRE mapping for each scenario.

5. Mixing Scoring with Generation

Explanation: Asking an LLM to calculate a risk score introduces variance. Two identical runs may yield different numerical priorities, breaking automated escalation workflows. Fix: Compute scores deterministically. Pass the final numerical value to the AI solely for contextual explanation in the report.

6. Assuming Native MITRE Understanding

Explanation: Models map techniques inconsistently. T1110 might be labeled "Credential Access" in one run and "Brute Force" in another. Compliance audits require exact technique IDs. Fix: Hardcode MITRE ATT&CK IDs in rule definitions. Let the AI reference the provided IDs rather than inferring them.

7. Over-Engineering the Prompt Template

Explanation: Complex prompts with conditional branching, JSON schema enforcement, and multi-step reasoning increase latency and failure rates. LLMs perform best with clear, single-purpose instructions. Fix: Keep prompts declarative. Provide structured data, specify output sections, and forbid logical modifications. Validate AI output against a schema before rendering.

Production Bundle

Action Checklist

Define a canonical event interface that abstracts provider-specific log formats
Implement pure-function detection rules that return evidence arrays
Build a deterministic scoring algorithm with configurable severity weights
Create a strict prompt template that forbids logical modifications or hallucinations
Unit-test all rules against synthetic event batches covering edge cases
Abstract the AI client behind an interface to enable provider swapping
Log all rule evaluations and AI responses for audit trail reconstruction
Implement schema validation on AI output before rendering to analysts

Decision Matrix

Scenario	Recommended Approach	Why	Cost Impact
High-volume triage (>10k events/day)	Deterministic rules + lightweight AI summarization	LLMs become cost-prohibitive at scale; deterministic engines handle throughput efficiently	Low inference cost, high compute for normalization
Compliance-heavy environments	Code-reviewed rules with hardcoded MITRE IDs	Auditors require reproducible findings and exact technique mapping	Moderate engineering overhead, zero compliance risk
Rapid prototyping / PoC	AI-assisted detection with strict output schemas	Faster iteration, but requires heavy prompt engineering and validation	High token cost, inconsistent reliability
Cost-constrained deployments	Local deterministic engine + open-source LLM	Eliminates API fees while maintaining auditability	Higher infrastructure cost, lower operational expense

Configuration Template

// triage.config.ts
import { DetectionRule } from './types';
import { consecutiveFailedLoginsFollowedBySuccess } from './rules/auth-chain';
import { encodedPowerShellExecution } from './rules/process-injection';
import { logClearingActivity } from './rules/defense-evasion';

export const triageConfig = {
  scoring: {
    severityWeights: { low: 10, medium: 25, high: 50, critical: 80 },
    maxScore: 100,
    volumeDecayFactor: 0.15
  },
  rules: [
    consecutiveFailedLoginsFollowedBySuccess,
    encodedPowerShellExecution,
    logClearingActivity
  ] as DetectionRule[],
  ai: {
    provider: 'claude', // 'openai' | 'claude' | 'ollama'
    model: 'claude-3-5-sonnet-20241022',
    temperature: 0.2,
    maxTokens: 1024,
    outputSchema: {
      executiveSummary: 'string',
      prioritizedFindings: 'array',
      mitreContext: 'string'
    }
  },
  normalization: {
    supportedProviders: ['windows-security', 'sysmon', 'linux-auth', 'generic-json'],
    timestampFormat: 'ISO_8601',
    ipValidation: true
  }
};

Quick Start Guide

Initialize the project structure: Create a TypeScript workspace with strict mode enabled. Define the canonical event interface and rule type signatures.
Load and normalize sample telemetry: Import a test log batch (Windows 4688, Sysmon 1, or Linux auth.log). Run it through the normalization adapter to produce NormalizedEvent[].
Execute the rule engine: Pass the normalized events to the configured rule registry. Collect evidence arrays, map MITRE techniques, and calculate the deterministic risk score.
Generate the triage report: Feed the structured findings into the AI prompt builder. Submit the request to your configured provider. Validate the response against the output schema and render the analyst summary.

I let the AI write the report, not decide the alerts