I evaluated my self-trained LLM what 31% accuracy actually means

Benchmarking Small Medical LLMs: Interpreting Accuracy Metrics and Optimizing RAG Architectures

Current Situation Analysis

The medical AI landscape is saturated with projects that prioritize demonstration over rigorous validation. Teams frequently deploy models based on cherry-picked success cases, creating a false sense of reliability. This approach is particularly dangerous in clinical domains where hallucination or error can have severe consequences. The industry often overlooks the necessity of establishing statistical baselines and measuring performance against unseen data distributions.

A realistic evaluation reveals the true state of small-scale model training. When a 1.3 billion parameter model is trained for 1.5 hours on a free GPU using medical multiple-choice questions (MCQs), the resulting accuracy is 31.0%. While this figure may appear low in isolation, it must be contextualized against a random baseline. For a four-option MCQ format, random guessing yields 25.0% accuracy. The model's performance exceeds this baseline by 6 percentage points, confirming that the model has learned meaningful patterns rather than memorizing noise.

However, this result also highlights the limitations of parameter-constrained models. GPT-4 achieves approximately 90.0% accuracy on the same benchmark, demonstrating a massive performance gap attributable to parameter scale and training data volume. The challenge for engineering teams is bridging this gap without incurring prohibitive compute costs, often requiring architectural interventions like Retrieval-Augmented Generation (RAG) and re-ranking rather than raw model scaling.

WOW Moment: Key Findings

The following comparison illustrates the performance landscape across different strategies. The data underscores that while small models show learning signals, production viability requires architectural enhancements or significant scale increases.

Strategy	Accuracy	Compute Cost	Clinical Viability	Key Insight
Random Baseline	25.0%	None	No	Statistical floor for 4-option MCQs.
1.3B Fine-Tuned	31.0%	Low (Free GPU)	Low	Beats random by 6%; proves learning but insufficient for use.
GPT-4 Zero-Shot	~90.0%	High	High	State-of-the-art; requires massive parameters and data.
1.3B + RAG + Re-rank	~55-65% (Projected)	Medium	Medium	Retrieval compensates for parameter limits; re-ranking boosts precision.
7B Fine-Tuned + RAG	~70-80% (Projected)	High	High	Larger base models unlock reasoning capabilities when augmented.

Why this matters: The 31.0% result is not a failure; it is a diagnostic metric. It indicates that the model architecture and training procedure are functional but constrained by capacity. The delta between 31.0% and the 90.0% ceiling suggests that improvements should focus on external knowledge injection (RAG) and retrieval quality rather than solely increasing training epochs on the small model.

Core Solution

Building a reliable medical AI system requires a modular architecture that separates evaluation, retrieval, and inference. This approach allows teams to iterate on components independently and swap models without disrupting the serving layer.

1. Structured Evaluation Pipeline

Evaluation must be automated and reproducible. The following TypeScript implementation demonstrates a robust evaluation engine that calculates accuracy against a defined baseline and handles batch processing efficiently.

interface Question {
  id: string;
  text: string;
  options: string[];
  correctIndex: number;
}

interface EvaluationResult {
  modelId: string;
  totalQuestions: number;
  correctPredictions: number;
  accuracy: number;
  randomBaseline: number;
  delta: number;
  timestamp: string;
}

class EvaluationEngine {
  private randomBaseline: number;

  constructor(numOptions: number) {
    this.randomBaseline = 1 / numOptions;
  }

  async runBenchmark(
    testSet: Question[],
    model: { predict: (q: Question) => Promise<number> }
  ): Promise<EvaluationResult> {
    let correctCount = 0;

    for (const question of testSet) {
      const predictedIndex = await model.predict(question);
      if (predictedIndex === question.correctIndex) {
        correctCount++;
      }
    }

    const accuracy = correctCount / testSet.length;
    const delta = accuracy - this.randomBaseline;

    return {
      modelId: model.constructor.name,
      totalQuestions: testSet.length,
      correctPredictions: correctCount,
      accuracy: parseFloat(accuracy.toFixed(4)),
      randomBaseline: parseFloat(this.randomBaseline.toFixed(4)),
      delta: parseFloat(delta.toFixed(4)),
      timestamp: new Date().toISOString(),
    };
  }
}

Architecture Decisions:

Interface-Based Model Injection: The model parameter accepts any object implementing the predict method. This decouples the evaluation logic from the model implementation, enabling seamless swapping between local models, API wrappers, and ensemble strategies.
Delta Calculation: Explicitly computing the delta against the random baseline provides immediate context for accuracy metrics, preventing misinterpretation of raw scores.
Type Safety: Strict typing for Question and EvaluationResult ensures data integrity throughout the pipeline.

2. Retrieval-Augmented Generation with Re-Ranking

To improve accuracy beyond the model's intrinsic capabilities, integrate a RAG pipeline. The source analysis indicates that a 1.3B model can benefit significantly from high-quality external knowledge.

Knowledge Base Curation: Replace generic text chunks with structured medical facts from authoritative sources like PubMed or clinical guidelines. Clean, domain-specific data improves retrieval relevance.
Cross-Encoder Re-Ranking: Standard retrieval often relies on bi-encoder cosine similarity, which can miss nuanced relevance. Implement a cross-encoder model to re-rank the top-k retrieved documents. The cross-encoder processes the query and document jointly, capturing deeper semantic interactions and filtering out false positives before prompt injection.

Rationale: Re-ranking addresses the "noise" in retrieval results. By ensuring only the most relevant context reaches the model, you reduce hallucination and improve the model's ability to select the correct answer, effectively boosting accuracy without retraining the base model.

3. Modular Serving Layer

The architecture should support model scaling without infrastructure changes. As noted in the analysis, swapping a 1.3B model for a Mistral 7B or LLaMA 3 8B should not require modifications to the API or UI.

Abstraction Layer: Define a standard inference interface that all models must implement.
Configuration-Driven Deployment: Use environment variables or configuration files to specify the active model. This allows A/B testing and gradual rollouts of larger models as compute budgets permit.

Pitfall Guide

Data Leakage in Test Sets
- Explanation: Using questions from the training distribution in the evaluation set inflates accuracy metrics and masks overfitting.
- Fix: Maintain a strictly disjoint test set. The source used 200 questions from a pool of 1,273 that were never used in training. Implement hash-based splitting and audit logs to verify separation.
Ignoring the Random Baseline
- Explanation: Reporting accuracy without a baseline can lead to false conclusions. A 31% accuracy might seem poor, but it is significantly better than the 25% random baseline.
- Fix: Always calculate and report the random baseline for MCQ formats. Use the delta metric to assess true model improvement.
Retrieval Quality Neglect
- Explanation: RAG performance is bottlenecked by retrieval quality. Poor chunking or irrelevant sources degrade model output regardless of model size.
- Fix: Optimize chunking strategies, use metadata filtering, and source data from high-quality repositories like PubMed. Validate retrieval relevance independently.
Skipping Re-Ranking
- Explanation: Relying solely on initial retrieval scores can introduce noise. Bi-encoders may return semantically similar but factually incorrect documents.
- Fix: Integrate a cross-encoder re-ranker to score query-document pairs jointly. This step significantly improves precision in the context window.
Scale Fallacy
- Explanation: Assuming a small model can match large model performance with identical training data ignores parameter limitations.
- Fix: Acknowledge capacity constraints. Use RAG and re-ranking to compensate for smaller models, or upgrade to larger base models like Mistral 7B or LLaMA 3 8B when accuracy requirements demand it.
Cherry-Picked Evaluation
- Explanation: Testing only on easy or familiar examples creates bias and overestimates real-world performance.
- Fix: Evaluate on a diverse, representative sample of unseen questions. Ensure the test set covers the full range of difficulty and topics.
Static Architecture
- Explanation: Hardcoding model references prevents scaling and experimentation.
- Fix: Design a modular architecture where models are swappable via configuration. This supports iterative improvement and cost optimization.

Production Bundle

Action Checklist

Define a strictly disjoint test set from the training distribution.
Calculate the random baseline accuracy for the MCQ format.
Implement an automated evaluation pipeline with delta reporting.
Audit retrieval sources for quality and relevance.
Integrate a cross-encoder re-ranker into the RAG pipeline.
Profile latency and cost for different model sizes.
Design a modular serving layer for model swapping.
Document limitations and accuracy metrics transparently.

Decision Matrix

Scenario	Recommended Approach	Why	Cost Impact
Prototype / Free Tier	1.3B Model + Basic RAG	Fast iteration, low compute cost, validates pipeline.	$0
Internal Tool / Low Risk	1.3B + RAG + Re-rank	Improved accuracy via retrieval; cost-effective.	Low
Production Clinical	7B Model + RAG + Re-rank	Higher accuracy required; re-ranking ensures precision.	Medium
Enterprise / High Stakes	GPT-4 / LLaMA 3 70B + RAG	Maximum accuracy and reliability; compliance ready.	High

Configuration Template

Use this YAML configuration to manage evaluation parameters and model settings.

evaluation:
  test_set:
    source: "medical_mcq_pool"
    size: 200
    split_strategy: "disjoint"
  metrics:
    baseline_method: "random_guess"
    num_options: 4
    report_delta: true

model:
  id: "medmind-opt-medical"
  type: "local"
  parameters:
    max_tokens: 128
    temperature: 0.0

retrieval:
  enabled: true
  source: "pubmed_clinical_guidelines"
  top_k: 5
  re_ranker:
    enabled: true
    model: "cross-encoder-medical"
    top_k: 3

Quick Start Guide

Setup Environment: Install dependencies and configure the evaluation engine using the provided TypeScript interfaces.
Load Test Data: Import the disjoint test set and verify separation from training data.
Run Benchmark: Execute the evaluation pipeline to generate accuracy metrics and delta reports.
Optimize Pipeline: Integrate RAG and re-ranking based on evaluation results.
Deploy: Configure the modular serving layer and swap models as needed via configuration.

Mid-Year Sale — Unlock Full Article