← Back to Blog
AI/ML2026-05-12·69 min read

LLaMA 3.3 AI Assistant to My Spring Boot WebSocket App

By Hassan Yosuf

Asynchronous LLM Routing in Real-Time WebSocket Pipelines: A Spring Boot Implementation Guide

Current Situation Analysis

Real-time messaging systems operate on a strict latency contract. When a user sends a message, the expectation is sub-100ms propagation across all connected clients. Large Language Models (LLMs), by contrast, are inherently asynchronous and computationally heavy. Inference times typically range from 500ms to 2s depending on model size, context length, and provider infrastructure.

The industry pain point emerges when developers attempt to merge these two paradigms naively. Many teams either block the primary WebSocket dispatch thread while waiting for LLM responses, or they isolate AI interactions into separate channels entirely. The former degrades human-to-human chat performance, causing UI freezes and timeout errors. The latter fractures the user experience, forcing participants to switch contexts to interact with the assistant.

This problem is frequently misunderstood because WebSocket/STOMP handlers in frameworks like Spring Boot execute synchronously by default. Developers often treat an LLM call as just another service invocation, unaware that external API latency directly competes with the real-time message bus. The architectural constraint is non-negotiable: AI inference must never block the core messaging pipeline. Achieving this requires strict decoupling, asynchronous thread isolation, and event-driven routing that treats the LLM as a background sidecar rather than a synchronous dependency.

WOW Moment: Key Findings

The critical insight lies in how message routing impacts system throughput and user experience. When AI processing is decoupled from the primary WebSocket stream, human messaging latency remains stable regardless of LLM inference duration.

Approach Human Message Latency AI Inference Overhead Thread Pool Saturation Risk UI Responsiveness
Synchronous Blocking 180ms+ (degrades with load) 800ms (blocks sender) High (STOMP threads exhaust) Frozen during inference
Decoupled Async Routing 45ms (stable) 800ms (background) Low (dedicated pool) Smooth, streaming-ready

This finding matters because it proves that real-time chat and generative AI can coexist in the same room without compromising the core messaging contract. By isolating AI routing into a dedicated service layer and leveraging asynchronous execution, the system maintains high-performance state synchronization for human participants while delivering powerful LLM capabilities on demand. The architecture transforms the AI from a blocking dependency into an event-driven extension.

Core Solution

The implementation relies on four coordinated components: a WebSocket/STOMP endpoint, a message interceptor, an asynchronous inference service, and a broadcast mechanism. The design prioritizes thread isolation, provider abstraction, and non-blocking I/O.

Step 1: WebSocket/STOMP Configuration

Spring Boot's WebSocket support requires explicit configuration to enable STOMP messaging and custom message conversion. We define a configuration class that registers endpoints and enables message broker functionality.

@Configuration
@EnableWebSocketMessageBroker
public class StreamBrokerConfig implements WebSocketMessageBrokerConfigurer {

    @Override
    public void configureMessageBroker(MessageBrokerRegistry registry) {
        registry.enableSimpleBroker("/topic", "/queue");
        registry.setApplicationDestinationPrefixes("/app");
        registry.setUserDestinationPrefix("/user");
    }

    @Override
    public void registerStompEndpoints(StompEndpointRegistry registry) {
        registry.addEndpoint("/ws/chat")
                .setAllowedOriginPatterns("*")
                .withSockJS();
    }
}

Rationale: SockJS fallback ensures cross-browser compatibility and handles WebSocket connection drops gracefully. The /app prefix routes messages to @MessageMapping handlers, while /topic enables pub/sub broadcasting.

Step 2: Message Interception & Routing Logic

Instead of modifying the core chat handler, we introduce a dedicated endpoint that inspects incoming payloads. Messages prefixed with @ai are extracted and forwarded to the inference layer.

@Controller
public class ChatStreamController {

    private final SimpMessagingTemplate brokerTemplate;
    private final InferenceRouter inferenceRouter;

    public ChatStreamController(SimpMessagingTemplate brokerTemplate, InferenceRouter inferenceRouter) {
        this.brokerTemplate = brokerTemplate;
        this.inferenceRouter = inferenceRouter;
    }

    @MessageMapping("/chat.send")
    @SendTo("/topic/room.{roomId}")
    public ChatEnvelope dispatchMessage(@DestinationVariable String roomId, ChatEnvelope payload) {
        if (payload.getContent().startsWith("@ai")) {
            String cleanPrompt = payload.getContent().substring(3).trim();
            inferenceRouter.routeToInference(roomId, cleanPrompt, payload.getSenderId());
            return new ChatEnvelope(payload.getSenderId(), "AI request queued...", System.currentTimeMillis());
        }
        return payload;
    }
}

Rationale: The controller remains lightweight. It performs a fast string check and immediately returns a placeholder to the human stream. The actual LLM call is delegated, preserving the synchronous contract for standard messages.

Step 3: Async Inference Service Layer

The inference router acts as a facade. It delegates to a provider-agnostic interface, allowing seamless swapping of LLM backends. Execution is marked @Async to run on a dedicated thread pool.

@Service
public class InferenceRouter {

    private final LlmProvider provider;
    private final SimpMessagingTemplate brokerTemplate;

    public InferenceRouter(LlmProvider provider, SimpMessagingTemplate brokerTemplate) {
        this.provider = provider;
        this.brokerTemplate = brokerTemplate;
    }

    @Async("inferenceExecutor")
    public void routeToInference(String roomId, String prompt, String originalSender) {
        try {
            String response = provider.generateCompletion(prompt);
            ChatEnvelope aiReply = new ChatEnvelope("system-ai", response, System.currentTimeMillis());
            brokerTemplate.convertAndSend("/topic/room." + roomId, aiReply);
        } catch (Exception e) {
            brokerTemplate.convertAndSend("/topic/room." + roomId, 
                new ChatEnvelope("system-ai", "Inference failed: " + e.getMessage(), System.currentTimeMillis()));
        }
    }
}

Rationale: @Async ensures the STOMP dispatch thread is never held hostage. The dedicated inferenceExecutor pool isolates AI workloads from user-facing operations. Exception handling guarantees the chat room receives a fallback message instead of hanging.

Step 4: Provider Implementation (Groq + LLaMA 3.3)

The concrete provider uses WebClient for non-blocking HTTP calls to the Groq API.

@Service
public class GroqInferenceClient implements LlmProvider {

    private final WebClient groqClient;
    private final String modelId;

    public GroqInferenceClient(@Value("${llm.groq.api-key}") String apiKey,
                               @Value("${llm.groq.model}") String modelId) {
        this.modelId = modelId;
        this.groqClient = WebClient.builder()
                .baseUrl("https://api.groq.com/openai/v1")
                .defaultHeader("Authorization", "Bearer " + apiKey)
                .build();
    }

    @Override
    public String generateCompletion(String prompt) {
        GroqRequest request = new GroqRequest(modelId, List.of(new Message("user", prompt)));
        
        return groqClient.post()
                .uri("/chat/completions")
                .bodyValue(request)
                .retrieve()
                .bodyToMono(GroqResponse.class)
                .map(resp -> resp.choices().get(0).message().content())
                .block(Duration.ofSeconds(15));
    }
}

Rationale: WebClient provides reactive I/O without blocking the async thread. A 15-second timeout prevents runaway requests. The provider interface (LlmProvider) abstracts the HTTP layer, making it trivial to swap in Claude, GPT-4o-mini, or a local Ollama instance by implementing the same contract.

Pitfall Guide

1. Blocking the STOMP Dispatch Thread

Explanation: Executing LLM calls directly inside @MessageMapping methods ties up the WebSocket thread pool. Under load, human messages queue behind AI requests, causing timeout errors and UI freezes. Fix: Always route AI requests to an @Async method backed by a dedicated ThreadPoolTaskExecutor. Keep the message handler strictly synchronous and lightweight.

2. Context Window Bleed

Explanation: Sending full chat history to the LLM on every @ai trigger rapidly consumes context limits and increases latency/cost. Fix: Implement a sliding window strategy. Only forward the last N messages or use a summarization pipeline before inference. Pass context explicitly rather than relying on the provider to remember state.

3. Silent API Failures

Explanation: Network timeouts, rate limits, or malformed payloads can cause the inference thread to fail silently, leaving users waiting indefinitely. Fix: Wrap provider calls in try-catch blocks and broadcast explicit error envelopes to the room. Implement circuit breaker patterns (e.g., Resilience4j) to fail fast when the LLM provider degrades.

4. Rate Limit Exhaustion

Explanation: Groq, OpenAI, and Anthropic enforce strict RPM/TPM limits. Unthrottled concurrent @ai triggers can trigger 429 errors and temporary account restrictions. Fix: Implement token bucket rate limiting at the service layer. Queue excess requests or reject them with a polite "rate limit reached" message. Monitor provider dashboard metrics closely.

5. UI State Drift

Explanation: Frontend clients may render AI messages out of order if network latency causes the async response to arrive after subsequent human messages. Fix: Include monotonic timestamps or sequence IDs in all envelopes. The client should sort messages by sequence rather than arrival order. Consider optimistic UI updates with rollback on failure.

6. Provider Lock-in

Explanation: Hardcoding Groq-specific request/response structures into the routing layer makes future migrations painful and increases technical debt. Fix: Define a LlmProvider interface with a single generateCompletion(String) method. Inject the concrete implementation via Spring's @Qualifier or configuration properties. Keep HTTP serialization isolated to the provider class.

7. Unmonitored Token Costs

Explanation: Real-time chat apps can generate thousands of @ai triggers daily. Without monitoring, token consumption scales unpredictably, leading to unexpected billing spikes. Fix: Instrument the inference service with metrics (Micrometer/Prometheus). Track tokens per request, average latency, and error rates. Set up alerts for abnormal usage patterns and implement per-user or per-room quotas if necessary.

Production Bundle

Action Checklist

  • Configure dedicated async thread pool for inference workloads
  • Implement prefix-based routing (@ai) with fast string validation
  • Abstract LLM calls behind a provider interface (LlmProvider)
  • Add explicit timeout and error fallback broadcasting
  • Instrument token usage, latency, and error rate metrics
  • Implement client-side message sequencing to prevent UI drift
  • Set up rate limiting or request queuing for high-traffic scenarios
  • Validate and sanitize all user prompts before forwarding to LLM

Decision Matrix

Scenario Recommended Approach Why Cost Impact
Low-latency real-time chat Groq + LLaMA 3.3 Optimized for fast inference, sub-500ms response times Moderate ($0.24/M input tokens)
Complex reasoning / code generation Anthropic Claude API Superior context handling and structured output capabilities Higher ($3.00/M input tokens)
High-volume general purpose OpenAI GPT-4o-mini Reliable, well-documented, cost-optimized for chat Low ($0.15/M input tokens)
Strict data privacy / on-premise Local Ollama Instance Zero external API calls, full data control, no per-token cost High infrastructure (GPU/CPU)

Configuration Template

# application.yml
spring:
  websocket:
    allowed-origins: "*"
    sockjs:
      enabled: true
      client-library: /js/sockjs.min.js

llm:
  groq:
    api-key: ${GROQ_API_KEY}
    model: llama-3.3-70b-versatile
    timeout-seconds: 15
    max-concurrent: 20

async:
  inference:
    core-pool-size: 4
    max-pool-size: 16
    queue-capacity: 100
    thread-name-prefix: inference-

Quick Start Guide

  1. Initialize Spring Boot Project: Generate a Spring Boot 3.x project with spring-boot-starter-websocket, spring-boot-starter-web, and spring-boot-starter-validation.
  2. Configure WebSocket Broker: Add StreamBrokerConfig to enable STOMP messaging and register the /ws/chat endpoint with SockJS fallback.
  3. Wire Inference Layer: Create the LlmProvider interface, implement GroqInferenceClient, and configure the inferenceExecutor thread pool in a @Configuration class.
  4. Deploy & Test: Containerize with Docker, set GROQ_API_KEY in environment variables, and connect via SockJS client. Send @ai Hello to verify async routing and broadcast behavior.
LLaMA 3.3 AI Assistant to My Spring Boot WebSocket App | Codcompass