Building Streamable HTTP MCP Servers from Scratch using FastMCP in 2026

By Codcompass Team·2026-05-18·10 min read

Standardizing AI Agent Integrations: A Production Guide to Streamable HTTP MCP Servers

Current Situation Analysis

The rapid proliferation of LLM-powered agents has exposed a critical fragmentation problem in software architecture. Every model vendor, framework, and AI host application implements its own function-calling schema, tool discovery mechanism, and context-passing format. Engineers building agent workflows spend disproportionate time writing bespoke adapters, translating between proprietary JSON structures, and patching integrations whenever a model provider updates its API. This glue code is brittle, untestable in isolation, and impossible to scale across multiple AI clients.

The industry initially treated agent tooling like traditional microservices, attempting to bolt REST endpoints or GraphQL resolvers onto LLM workflows. This approach fails because LLM interactions are inherently stateful, discovery-driven, and require bidirectional communication. Agents need to dynamically discover available capabilities, maintain session context across multiple tool invocations, and receive progress updates for long-running operations. Traditional request/response paradigms cannot accommodate these requirements without significant architectural overhead.

Anthropic introduced the Model Context Protocol (MCP) in November 2024 to solve this exact problem. By spring 2025, OpenAI, Microsoft, and Google had formally adopted the specification, cementing it as the de facto standard for agent-to-tool communication. MCP abstracts the integration layer by exposing three core primitives over a unified JSON-RPC 2.0 wire format: tools (executable functions), resources (read-only data sources), and prompts (reusable instruction templates). The protocol supports two primary transports: standard I/O (stdio) for local processes and Streamable HTTP for networked deployments. This standardization eliminates vendor lock-in, enables runtime capability discovery, and allows AI hosts to interact with external systems through a single, consistent interface.

Despite rapid adoption, many engineering teams still misunderstand MCP's operational model. They attempt to force stateless HTTP patterns onto a protocol designed for session-aware, bidirectional messaging. Others overlook transport-specific constraints, leading to performance bottlenecks or broken client connections. Understanding the protocol's architectural intent and implementing it correctly is now a prerequisite for building scalable AI agent infrastructure.

WOW Moment: Key Findings

The engineering impact of adopting MCP becomes immediately visible when comparing integration workflows across traditional approaches and the standardized protocol. The following comparison highlights why MCP fundamentally changes how teams architect agent tooling.

Approach	Integration Effort	State Management	Discovery Mechanism	Streaming Capability	Versioning Strategy
MCP Architecture	Single SDK/wire-spec; plug-and-play across hosts	Stateful sessions with explicit lifecycle	Runtime capability negotiation via `tools/list`	Standardized via Streamable HTTP (SSE/WebSocket)	Backward-compatible extensions; clients adapt dynamically
Traditional REST/GraphQL	Bespoke adapters per service; high maintenance	Stateless request/response; session tracked externally	Static OpenAPI/Swagger specs; manual updates	Ad-hoc WebSockets or polling; not standardized	Breaking changes require versioned endpoints or client updates
Custom Agent Adapters	High; framework-specific glue code	Implicit; often lost across tool chains	Hardcoded function schemas; no negotiation	Framework-dependent; often unsupported	Tightly coupled to model provider updates

This comparison reveals why MCP matters: it shifts the integration burden from the AI host to the tool provider. Instead of every client implementing custom parsers, authentication flows, and retry logic for each external service, developers expose capabilities once through an MCP server. The protocol's standardized discovery and streaming mechanisms enable agents to chain tools dynamically, handle long-running operations with progress callbacks, and maintain context across multi-step workflows. For engineering teams, this translates to reduced maintenance overhead, faster onboarding of new AI clients, and a clear separation between business logic and agent orchestration.

Core Solution

Building a production-ready MCP server requires understanding the protocol's transport mechanics, schema validation boundaries, and session lifecycle. We will implement a Streamable HTTP server using FastMCP, focusing on a system diagnos

tics use case that exposes infrastructure metrics, configuration resources, and incident reporting templates.

Architecture Decisions and Rationale

Transport Selection: Streamable HTTP is chosen over stdio for networked deployments. HTTP enables load balancing, reverse proxy integration, and multi-client connectivity. The protocol uses Server-Sent Events (SSE) for bidirectional messaging, allowing the server to push progress updates and the client to send tool requests over a single persistent connection.
Framework Choice: FastMCP (Python) abstracts JSON-RPC 2.0 serialization, schema validation, and transport routing. It provides decorator-based tool/resource/prompt registration, automatic OpenAPI-compatible schema generation, and built-in HTTP server utilities. For edge deployments or TypeScript-heavy stacks, the official TypeScript SDK offers equivalent functionality with stricter type safety.
Session Management: MCP maintains explicit session state. The server must handle initialize, initialized, and session lifecycle events. FastMCP manages this automatically, but production deployments should implement session cleanup and timeout handling.
Schema Validation: All tool inputs and outputs are validated against JSON Schema. FastMCP infers schemas from Python type hints and docstrings. Explicit schema definitions are recommended for complex nested structures to prevent client-side parsing errors.

Implementation

The following implementation demonstrates a Streamable HTTP MCP server exposing system diagnostics capabilities. The code uses distinct naming conventions, modular structure, and production-ready error handling.

# diagnostics_server.py
import asyncio
import logging
import os
import platform
import psutil
from typing import Dict, List, Optional
from fastmcp import FastMCP

# Configure structured logging for production observability
logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s | %(levelname)s | %(name)s | %(message)s"
)
logger = logging.getLogger("mcp.diagnostics")

# Initialize server instance with explicit transport configuration
diagnostics_mcp = FastMCP(
    name="Infrastructure Diagnostics Server",
    instructions="Provides system metrics, configuration resources, and incident reporting templates for AI agents."
)

# Tool: Retrieve real-time system performance metrics
@diagnostics_mcp.tool()
async def fetch_system_metrics(
    include_network: bool = True,
    include_disk: bool = True
) -> Dict[str, object]:
    """
    Collects CPU, memory, network, and disk utilization data.
    
    Args:
        include_network: Toggle network interface statistics collection.
        include_disk: Toggle disk partition usage collection.
        
    Returns:
        Dictionary containing normalized system metrics.
    """
    logger.info("Collecting system metrics | network=%s, disk=%s", include_network, include_disk)
    
    metrics: Dict[str, object] = {
        "platform": platform.system(),
        "cpu_percent": psutil.cpu_percent(interval=1),
        "memory": {
            "total_gb": round(psutil.virtual_memory().total / (1024**3), 2),
            "available_gb": round(psutil.virtual_memory().available / (1024**3), 2),
            "usage_percent": psutil.virtual_memory().percent
        }
    }
    
    if include_disk:
        metrics["disk"] = {
            partition.device: {
                "total_gb": round(partition.total / (1024**3), 2),
                "used_gb": round(partition.used / (1024**3), 2),
                "percent": partition.percent
            }
            for partition in psutil.disk_partitions(all=False)
        }
        
    if include_network:
        net_io = psutil.net_io_counters()
        metrics["network"] = {
            "bytes_sent_mb": round(net_io.bytes_sent / (1024**2), 2),
            "bytes_recv_mb": round(net_io.bytes_recv / (1024**2), 2)
        }
        
    return metrics

# Tool: Execute safe diagnostic commands with timeout enforcement
@diagnostics_mcp.tool()
async def run_network_diagnostic(
    target_host: str,
    packet_count: int = 4,
    timeout_seconds: int = 10
) -> Dict[str, object]:
    """
    Performs a controlled network reachability check.
    
    Args:
        target_host: IP address or hostname to probe.
        packet_count: Number of probe packets to send.
        timeout_seconds: Maximum execution duration.
        
    Returns:
        Diagnostic results including latency and packet loss.
    """
    logger.info("Running network diagnostic | target=%s, packets=%d", target_host, packet_count)
    
    try:
        # Simulated diagnostic execution for demonstration
        # In production, replace with subprocess.run() or async network library
        await asyncio.sleep(0.5)  # Simulate network latency
        
        return {
            "target": target_host,
            "packets_sent": packet_count,
            "packets_received": packet_count,
            "packet_loss_percent": 0.0,
            "avg_latency_ms": 12.4,
            "status": "reachable"
        }
    except Exception as exc:
        logger.error("Diagnostic failed for %s: %s", target_host, exc)
        return {
            "target": target_host,
            "status": "unreachable",
            "error": str(exc)
        }

# Resource: Expose environment configuration as read-only data
@diagnostics_mcp.resource("config://environment")
async def load_environment_config() -> str:
    """Returns current deployment environment variables and runtime settings."""
    config_data = {
        "runtime": platform.python_version(),
        "hostname": platform.node(),
        "env_vars": {
            key: os.getenv(key, "<not set>") 
            for key in ["APP_ENV", "LOG_LEVEL", "MAX_METRICS_AGE"]
        }
    }
    return str(config_data)

# Prompt: Generate structured incident reports from raw metrics
@diagnostics_mcp.prompt()
def format_incident_report(
    severity: str,
    affected_service: str,
    raw_metrics: str
) -> str:
    """
    Constructs a standardized incident summary for alerting systems.
    
    Args:
        severity: Critical, Warning, or Info.
        affected_service: Name of the impacted component.
        raw_metrics: JSON string of collected diagnostic data.
        
    Returns:
        Formatted incident report template.
    """
    return (
        f"## Incident Report\n"
        f"- **Severity**: {severity}\n"
        f"- **Service**: {affected_service}\n"
        f"- **Timestamp**: {asyncio.get_event_loop().time()}\n"
        f"- **Metrics**: {raw_metrics}\n"
        f"- **Recommended Action**: Review resource utilization and scale horizontally if thresholds exceeded."
    )

# HTTP transport configuration and startup
if __name__ == "__main__":
    import uvicorn
    from fastmcp.server.transports import StreamableHttpTransport
    
    # Bind to configurable host/port for containerized deployments
    host = os.getenv("MCP_HTTP_HOST", "0.0.0.0")
    port = int(os.getenv("MCP_HTTP_PORT", "8080"))
    
    logger.info("Starting Streamable HTTP MCP server on %s:%d", host, port)
    
    # FastMCP handles JSON-RPC 2.0 routing and SSE streaming automatically
    diagnostics_mcp.run(
        transport="streamable-http",
        host=host,
        port=port,
        log_level="info"
    )

Why This Architecture Works

The implementation separates concerns cleanly: tools handle executable logic, resources expose static configuration, and prompts standardize output formatting. FastMCP's decorator system automatically generates JSON Schema definitions from type hints and docstrings, ensuring clients receive accurate parameter specifications. The Streamable HTTP transport leverages SSE for persistent connections, enabling progress callbacks and bidirectional messaging without polling. Environment-driven configuration (MCP_HTTP_HOST, MCP_HTTP_PORT) ensures compatibility with container orchestration platforms and reverse proxies.

Pitfall Guide

1. Blocking the Event Loop in HTTP Handlers

Explanation: Synchronous I/O operations (file reads, network calls, subprocess execution) inside tool functions will block the async event loop, causing SSE connection timeouts and dropped client sessions. Fix: Always use async def for tool implementations. Wrap blocking operations in asyncio.to_thread() or replace with native async libraries (aiohttp, asyncpg, httpx).

2. Ignoring Transport-Specific Constraints

Explanation: stdio and Streamable HTTP handle message framing differently. stdio relies on newline-delimited JSON, while HTTP uses SSE streams with explicit event types. Mixing transport assumptions causes parsing failures. Fix: Validate transport behavior during development. Use fastmcp.test() for stdio simulation and explicit HTTP clients for network testing. Never hardcode transport assumptions in business logic.

3. Overlooking JSON Schema Validation Boundaries

Explanation: FastMCP infers schemas from type hints, but complex nested structures, optional fields, or custom enums may generate incomplete schemas. Clients will reject malformed inputs or fail to serialize responses. Fix: Explicitly define Field constraints using Pydantic or typing.Annotated. Document edge cases in docstrings. Validate schemas against the MCP specification before deployment.

4. Treating MCP Sessions as Stateless

Explanation: MCP maintains explicit session state for capability negotiation and progress tracking. Assuming statelessness leads to lost context, duplicate tool registrations, or broken streaming connections. Fix: Implement session lifecycle hooks (on_connect, on_disconnect). Store session metadata in a distributed cache (Redis) for horizontally scaled deployments. Clean up stale sessions with TTL policies.

5. Hardcoding Secrets in Tool Definitions

Explanation: Embedding API keys, database credentials, or tokens directly in tool code exposes them in schema documentation and client logs. AI hosts may inadvertently leak secrets in conversation history. Fix: Use environment variables or secret managers (HashiCorp Vault, AWS Secrets Manager). Pass credentials via resource URIs or secure headers. Never include secrets in tool descriptions or return values.

6. Neglecting Progress Callbacks for Long-Running Tools

Explanation: Tools executing for >5 seconds without progress updates cause client timeouts and poor UX. MCP supports progress notifications, but developers often omit them. Fix: Implement yield-based progress reporting or use mcp.session.send_progress(). Update clients at logical checkpoints (e.g., 25%, 50%, 75%, 100%). Set reasonable timeouts in client configurations.

7. Mismanaging Resource URI Schemes

Explanation: Resource URIs must follow scheme://authority/path conventions. Invalid schemes or missing authority components break client resolution and caching mechanisms. Fix: Use standardized schemes (config://, data://, file://). Validate URIs against RFC 3986. Implement resource versioning via query parameters (config://env?v=2) to support cache invalidation.

Production Bundle

Action Checklist

Validate all tool schemas against MCP JSON-RPC 2.0 specification before deployment
Implement authentication middleware (JWT, API keys, or mTLS) for HTTP transport endpoints
Configure CORS headers and rate limiting to prevent abuse from untrusted AI hosts
Set up structured logging with correlation IDs for tracing tool execution across sessions
Implement graceful shutdown handlers to close SSE connections and flush pending metrics
Test server behavior with multiple concurrent clients to verify session isolation
Add health check endpoints (/health, /ready) for container orchestration platforms
Document resource URI schemes and prompt templates for client integration teams

Decision Matrix

Scenario	Recommended Approach	Why	Cost Impact
Local AI development & testing	stdio transport with FastMCP (Python)	Zero network overhead, simple process management, ideal for IDE integrations	Minimal; runs on developer machines
Multi-client cloud deployment	Streamable HTTP with reverse proxy	Enables load balancing, authentication, and scalable session management	Moderate; requires infrastructure (LB, TLS, monitoring)
Edge/Serverless deployment	TypeScript MCP SDK on Cloudflare/Vercel	Native HTTP support, cold-start optimization, global distribution	Higher; platform-specific pricing, but reduces egress costs
High-throughput metric collection	Async tools with connection pooling	Prevents event loop blocking, maintains SSE stability under load	Low; requires async library investment
Strict compliance environments	Resource-based credential injection	Keeps secrets out of tool schemas and conversation logs	Moderate; requires secret manager integration

Configuration Template

# docker-compose.yml
version: "3.9"
services:
  mcp-diagnostics:
    build: .
    ports:
      - "8080:8080"
    environment:
      - MCP_HTTP_HOST=0.0.0.0
      - MCP_HTTP_PORT=8080
      - LOG_LEVEL=info
      - APP_ENV=production
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
      interval: 30s
      timeout: 10s
      retries: 3
    deploy:
      resources:
        limits:
          memory: 512M
          cpus: "0.5"
    logging:
      driver: json-file
      options:
        max-size: "10m"
        max-file: "3"

# requirements.txt
fastmcp>=2.0.0
psutil>=5.9.0
uvicorn>=0.27.0
pydantic>=2.0.0

Quick Start Guide

Initialize Project: Create a virtual environment and install dependencies: python -m venv .venv && source .venv/bin/activate && pip install fastmcp psutil uvicorn
Scaffold Server: Save the diagnostics_server.py implementation to your project root. Ensure type hints and docstrings match your target capabilities.
Launch Transport: Run python diagnostics_server.py. The server binds to 0.0.0.0:8080 and exposes Streamable HTTP endpoints with automatic SSE streaming.
Connect Client: Configure your AI host (Claude Desktop, Cursor, or custom MCP client) to point to http://localhost:8080/mcp. Verify tool discovery by invoking tools/list and testing metric collection.

🎉 Mid-Year Sale — Unlock Full Article

Base plan from just $4.99/mo or $49/yr

7-day free trial · Cancel anytime · 30-day money-back