Back to KB
Difficulty
Intermediate
Read Time
8 min

AcuityBench: Evaluating Clinical Acuity Identification and Uncertainty Alignment

By Codcompass Team··8 min read

Clinical Triage in LLMs: Measuring Acuity Accuracy and Uncertainty Alignment

Current Situation Analysis

Deploying large language models (LLMs) in healthcare interfaces requires more than general medical knowledge; it demands precise risk stratification. When a user describes symptoms, the model must determine the appropriate urgency of care. This capability, known as acuity identification, is distinct from medical question answering. A model can correctly diagnose a condition in a QA setting but fail catastrophically by recommending home monitoring for a condition requiring immediate emergency intervention.

The industry faces a critical evaluation gap. Existing health benchmarks are fragmented: some focus on narrow workflow-specific triage, others on broad health interactions, and many on standard medical QA. None provide a unified framework to assess how well models identify urgency across diverse interaction modes. This oversight leads to a false sense of security. Teams often rely on QA scores as a proxy for safety, ignoring that triage involves risk management and uncertainty handling that QA tasks do not stress.

Recent analysis via the AcuityBench framework highlights the scale of this challenge. By harmonizing five public datasets—spanning user conversations, online forums, clinical vignettes, and patient portal messages—into a shared four-level acuity schema, researchers evaluated 914 cases across 12 frontier models. The benchmark reveals that acuity identification is a distinct safety-critical capability. The dataset includes 697 consensus cases for standard accuracy and 217 physician-confirmed ambiguous cases designed to test uncertainty alignment. The findings expose substantial variation in error direction and a systematic mismatch between model confidence and clinical reality.

WOW Moment: Key Findings

The most significant insight from the benchmark data is the format-dependent risk tradeoff. The evaluation method fundamentally alters the model's safety profile. Furthermore, models exhibit a dangerous calibration error in ambiguous scenarios: they are significantly more confident than expert physicians, concentrating predictions where clinical judgment admits uncertainty.

Evaluation FormatOver-Triage RateUnder-Triage RateUncertainty Alignment
Structured QAHighLowPoor (Over-confident)
Free-Form ChatLowHighPoor (Over-confident)
Physician BaselineModerateLowHigh (Calibrated Uncertainty)

Why this matters:

  1. The Tradeoff: Structured QA formats force models to select a single acuity level, which suppresses under-triage but inflates over-triage. Conversational responses allow models to hedge or provide nuanced advice, reducing over-triage but increasing the risk of under-triage, particularly in high-acuity cases. Relying on a single format gives an incomplete safety picture.
  2. Uncertainty Mismatch: In the 217 ambiguous cases, no model matched the distribution of physician judgments. Physicians often express uncertainty or recommend further evaluation when cases are borderline. Models, however, produce concentrated predictions, effectively "guessing" with high confidence. This over-confidence in ambiguous cases is a primary driver of safety failures.
  3. Error Direction Variance: Across the 12 models tested, error patterns varied widely. Some models consistently over-triage, while others under-triage. This variance means safety cannot be assumed based on model size or general performance; it must be measured per model and per use case.

Core Solution

To address these risks, engineering teams must implement a Dual-Format Acuity Evaluation Pipeline that tests both structured classification and conversational responses, while explicitly measuring uncertainty calibration. The following TypeScript implementation demonstrates how to build t

🎉 Mid-Year Sale — Unlock Full Article

Base plan from just $4.99/mo or $49/yr

Sign in to read the full article and unlock all 635+ tutorials.

Sign In / Register — Start Free Trial

7-day free trial · Cancel anytime · 30-day money-back