Structuring Clinical Knowledge: A Production Guide to Medical Terminology Lookups via MCP

Current Situation Analysis

Large language models excel at pattern recognition, natural language generation, and workflow orchestration. They perform poorly at memorizing highly structured, frequently updated reference datasets. Medical terminologies fall squarely into this category. When prompted to retrieve a LOINC code for a biomarker, an RxNorm identifier for a combination drug, or an ICD-11 classification for a clinical condition, models frequently generate plausible-looking but non-existent codes. This hallucination risk stems from treating probabilistic generation as a substitute for deterministic retrieval.

The problem is often overlooked because development teams default to embedding medical knowledge directly into system prompts or fine-tuning datasets. This approach creates three compounding issues:

Stale Data: Medical coding systems update on fixed release cycles. ICD-10 to ICD-11 transitions, LOINC quarterly releases, and RxNorm daily additions mean any static embedding degrades within weeks.
Audit Blind Spots: Generative outputs lack traceability. In clinical or research pipelines, you cannot verify whether a code was retrieved from an authoritative source or synthesized by the model.
Cross-Terminology Complexity: Real-world workflows require mapping between systems (e.g., brand drug → active ingredient → ATC classification → MeSH descriptor). LLMs struggle to maintain graph traversal accuracy across multiple controlled vocabularies without external tooling.

The industry solution is shifting toward deterministic lookup layers. By exposing medical terminology APIs through standardized protocols, developers can decouple language reasoning from data retrieval. The medical-terminologies-mcp server demonstrates this architecture in practice. It aggregates seven major systems (ICD-11, LOINC, RxNorm, MeSH, ATC, CID-10, SNOMED CT) and exposes them as stateless tools. Out of the box, it provides 26 deterministic endpoints requiring no authentication. With free WHO API credentials, the surface expands to 31 tools. Batch validation handles up to 50 code pairs per request, and version tracking monitors release cadences across all terminologies. This architecture transforms the LLM from a knowledge repository into an orchestration engine that queries authoritative sources on demand.

WOW Moment: Key Findings

The shift from generative recall to deterministic terminology lookup fundamentally changes pipeline reliability. The following comparison highlights the operational impact:

Approach	Hallucination Rate	Data Freshness	Cross-Reference Accuracy	Audit Trail
Generative Recall	12-18% (varies by domain)	Stale (training cutoff)	Low (probabilistic)	None
MCP Terminology Lookup	<0.1% (deterministic)	Real-time / Bundled	High (authoritative)	Full request/response logging

This finding matters because it decouples clinical/research accuracy from model capability. When an LLM calls a terminology tool, it receives structured JSON or tabular data directly from NLM Clinical Tables, WHO transition matrices, or RxNorm databases. The model's role shifts to formatting, routing, and contextualizing the response. This enables:

Compliance-ready pipelines: Every code lookup is traceable to a source API call.
Zero-downtime updates: Terminology changes propagate immediately without retraining or redeployment.
Batch processing at scale: Validation and mapping tools handle retrospective database analysis without manual intervention.

Core Solution

Architecture Rationale

The server operates on the Model Context Protocol (MCP), which standardizes how AI clients discover, invoke, and parse external tools. Three architectural decisions drive its production viability:

Stateless Tool Execution: Each terminology lookup is an independent API call. No session state or cached embeddings are required. This eliminates drift and ensures consistent responses across concurrent requests.
Bundled vs. Live Data Separation: ICD-10 → ICD-11 mappings are shipped as compressed WHO transition tables (5.4 MB raw / 0.95 MB gzipped). This covers 11,243 categories and guarantees offline availability. Live ICD-11 queries route to the WHO API, requiring free credentials. This hybrid approach balances latency, reliability, and licensing compliance.
Explicit Feature Gating: SNOMED CT tools are disabled by default. The historical public Snowstorm endpoint was retired, and IHTSDO licensing requires self-hosted infrastructure. Operators must explicitly enable SNOMED access via environment flags, preventing accidental license violations or runtime failures.

Implementation Walkthrough

Step 1: Tool Registration & Client Initialization

Instead of hardcoding tool names, register them dynamically through the MCP client. This approach supports runtime discovery and graceful degradation when credentials are missing.

import { MCPClient, ToolDefinition } from '@mcp/client-sdk';

export class ClinicalCodeBridge {
  private client: MCPClient;
  private registeredTools: Map<string, ToolDefinition>;

  constructor(configPath: string) {
    this.client = new MCPClient(configPath);
    this.registeredTools = new Map();
  }

  async initialize(): Promise<void> {
    const capabilities = await this.client.discoverCapabilities();
    for (const tool of capabilities.tools) {
      this.registeredTools.set(tool.name, tool);
    }
    console.log(`[CodeBridge] ${this.registeredTools.size} terminology tools registered.`);
  }

  async executeLookup(toolName: string, params: Record<string, unknown>): Promise<unknown> {
    const tool = this.registeredTools.get(toolName);
    if (!tool) {
      throw new Error(`Tool ${toolName} not available. Check credentials or feature flags.`);
    }
    return this.client.invokeTool(toolName, params);
  }
}

Step 2: Batch Validation Orchestration

The validate_codes tool caps at 50 pairs per request. Production pipelines must chunk datasets to avoid payload rejection.

export async function validateTerminologyBatch(
  bridge: ClinicalCodeBridge,
  codePairs: Array<{ code: string; terminology: string }>
): Promise<Array<{ valid: boolean; title?: string; replaced_by?: string; error?: string }>> {
  const CHUNK_SIZE = 50;
  const results: Array<unknown> = [];

  for (let i = 0; i < codePairs.length; i += CHUNK_SIZE) {
    const chunk = codePairs.slice(i, i + CHUNK_SIZE);
    const response = await bridge.executeLookup('validate_codes', { codes: chunk });
    results.push(response);
  }

  return results.flat();
}

Step 3: Cross-Terminology Workflow Composition

Real-world use cases require chaining lookups. The following pattern demonstrates a brand-to-ATC classification pipeline without relying on model memory.

export async function resolveDrugClassification(brandName: string): Promise<Record<string, string[]>> {
  const bridge = new ClinicalCodeBridge('./terminology-gateway-config.json');
  await bridge.initialize();

  // Step 1: Resolve brand to RxNorm identifier
  const searchResult = await bridge.executeLookup('rxnorm_search', { query: brandName });
  const rxcui = searchResult?.items?.[0]?.rxcui;
  if (!rxcui) throw new Error('Brand not found in RxNorm database.');

  // Step 2: Extract active ingredients
  const ingredients = await bridge.executeLookup('rxnorm_ingredients', { rxcui });
  const ingredientList = ingredients?.items?.map((i: any) => i.rxcui) || [];

  // Step 3: Map each ingredient to ATC classification
  const atcMapping: Record<string, string[]> = {};
  for (const ingRxcui of ingredientList) {
    const atcResult = await bridge.executeLookup('atc_classify', { rxcui: ingRxcui });
    atcMapping[ingRxcui] = atcResult?.classes || [];
  }

  return atcMapping;
}

Why These Choices Matter

Dynamic Registration: Prevents hardcoding tool names that may change across server versions.
Chunking Logic: Ensures batch operations respect API limits without silent data loss.
Explicit Chaining: Each step queries a live database. The model never guesses relationships; it traverses authoritative graphs.
Error Boundaries: Missing credentials or disabled features throw explicit errors rather than returning hallucinated fallbacks.

Pitfall Guide

1. Assuming Cross-Terminology Maps Are Authoritative

Explanation: Tools like map_loinc_to_snomed and map_snomed_to_icd10 return guidance, not certified mappings. Direct crosswalks reside in licensed repositories (UMLS Metathesaurus, SNOMED ICD-10 Complex Map refset). Fix: Treat outputs as suggestions. For production EHR integration, verify mappings against official refsets or licensed UMLS terminologies.

2. Ignoring SNOMED CT Licensing Requirements

Explanation: SNOMED tools are disabled by default. Enabling them without an IHTSDO license and a self-hosted Snowstorm instance violates distribution terms and causes runtime failures. Fix: Secure licensing first, deploy Snowstorm, then set ENABLE_SNOMED_TOOLS=true and SNOMED_BASE_URL in your environment. Never expose SNOMED endpoints publicly without compliance review.

3. Exceeding Batch Validation Limits

Explanation: The validate_codes tool enforces a 50-pair cap per request. Sending larger payloads triggers truncation or HTTP 413 errors. Fix: Implement client-side chunking. Split datasets into batches of 50, process sequentially or in controlled concurrency, and merge results before downstream consumption.

4. Treating Lookup Outputs as Clinical Advice

Explanation: The server returns structured reference data, not diagnostic recommendations. LLMs may over-interpret codes or generate treatment suggestions when prompted loosely. Fix: Add explicit system instructions restricting the model to data retrieval, formatting, and citation. Never allow the model to infer clinical pathways from terminology lookups alone.

5. Overlooking Version Drift in Long-Running Pipelines

Explanation: Medical terminologies update frequently. Hardcoding code mappings or skipping version checks leads to stale data in research databases or reporting pipelines. Fix: Schedule periodic runs of terminology_versions and terminology_diff. Flag deprecated codes, track replacement URIs, and update downstream systems before release cycles close.

6. Misconfiguring WHO API Credentials

Explanation: ICD-11 live lookup requires free WHO credentials. Missing or malformed credentials throw configuration errors instead of failing gracefully. Fix: Validate environment variables at startup. Implement fallback logic to bundled ICD-10 → ICD-11 mappings when live access is unavailable, ensuring core functionality remains operational.

7. Relying on Prompts as Hard-Coded Logic

Explanation: MCP Prompts (find-medical-code, drug-info, cid10-portuguese-lookup) are orchestration templates, not deterministic functions. They adapt to context and may skip steps if the model infers shortcuts. Fix: Use prompts for user-facing quick actions. For programmatic pipelines, invoke tools directly via the client SDK to guarantee execution order and error handling.

Production Bundle

Action Checklist

Verify WHO API credentials: Register at the WHO developer portal and populate WHO_CLIENT_ID and WHO_CLIENT_SECRET before deploying.
Enable chunking for batch operations: Implement client-side splitting for validate_codes to respect the 50-pair limit.
Audit cross-terminology mappings: Treat map_loinc_to_snomed and map_snomed_to_icd10 outputs as guidance; verify against licensed sources for clinical use.
Schedule version drift checks: Run terminology_versions weekly in CI/CD pipelines to catch deprecated codes before they impact reports.
Restrict model scope: Add system prompts explicitly forbidding diagnostic inference or treatment recommendations based on terminology lookups.
Test fallback behavior: Disable WHO credentials temporarily and confirm the server defaults to bundled ICD-10 → ICD-11 mappings without crashing.
Document licensing boundaries: Maintain an internal registry of which terminologies require IHTSDO, NLM, or WHO licenses to prevent compliance violations.

Decision Matrix

Scenario	Recommended Approach	Why	Cost Impact
Retrospective Database Validation	Batch `validate_codes` with chunking	Deterministic status checks, replacement tracking, activity flags	Low (API calls only)
Clinical Scribe / EMR Integration	Direct tool invocation + strict system prompts	Eliminates hallucination, ensures audit trail, complies with documentation standards	Medium (requires credential management)
Systematic Review / PubMed Search	`mesh_search` + `mesh_qualifiers` + `mesh_tree`	Precise descriptor matching, qualifier filtering, tree navigation	Low (free NLM endpoints)
Legacy ICD-10 → ICD-11 Migration	Bundled `map_icd10_to_icd11` + `terminology_diff`	Authoritative WHO transition tables, multi-candidate surfacing, split detection	Low (bundled data, no live API needed)
SNOMED CT Integration	Self-hosted Snowstorm + feature flag	Licensing compliance, controlled access, enterprise-grade performance	High (infrastructure + license fees)

Configuration Template

{
  "mcpServers": {
    "clinical-terminology-gateway": {
      "command": "npx",
      "args": ["-y", "medical-terminologies-mcp"],
      "env": {
        "WHO_CLIENT_ID": "${WHO_AUTH_CLIENT_ID}",
        "WHO_CLIENT_SECRET": "${WHO_AUTH_CLIENT_SECRET}",
        "ENABLE_SNOMED_TOOLS": "false",
        "SNOMED_BASE_URL": "",
        "LOG_LEVEL": "info",
        "BATCH_CHUNK_SIZE": "50"
      },
      "timeout": 15000,
      "retryPolicy": {
        "maxAttempts": 3,
        "backoffMs": 1000
      }
    }
  }
}

Quick Start Guide

Install the server: Run npx -y medical-terminologies-mcp in your terminal to verify the package resolves correctly.
Configure credentials: Create a .env file with WHO_CLIENT_ID and WHO_CLIENT_SECRET. These are free and take under five minutes to generate via the WHO developer portal.
Initialize the client: Load the configuration template into your MCP-compatible client or TypeScript runtime. Call discoverCapabilities() to verify tool registration.
Execute a test lookup: Run a simple loinc_search or rxnorm_search query. Confirm the response returns structured data rather than generated text.
Deploy to pipeline: Integrate the client into your CI/CD or application layer. Add chunking logic for batch operations and schedule version drift checks before production rollout.