deserialization to fail fast on malformed responses.
- Connection Pooling: Ollama supports concurrent requests. Configure the HTTP client with a connection pool sized to your GPU's batch processing capacity.
Step 3: Implementation
The following example demonstrates a robust InferenceGateway service. This class abstracts the Ollama API, handles retries, and manages configuration.
package com.codcompass.ai.inference;
import com.fasterxml.jackson.annotation.JsonProperty;
import com.fasterxml.jackson.databind.ObjectMapper;
import java.io.IOException;
import java.net.URI;
import java.net.http.HttpClient;
import java.net.http.HttpRequest;
import java.net.http.HttpResponse;
import java.time.Duration;
import java.util.List;
import java.util.Map;
import java.util.concurrent.CompletableFuture;
public class InferenceGateway {
private final HttpClient httpClient;
private final ObjectMapper jsonMapper;
private final String baseUrl;
private final String defaultModel;
public InferenceGateway(String baseUrl, String defaultModel) {
this.baseUrl = baseUrl;
this.defaultModel = defaultModel;
this.jsonMapper = new ObjectMapper();
this.httpClient = HttpClient.newBuilder()
.connectTimeout(Duration.ofSeconds(10))
.build();
}
/**
* Executes a synchronous completion request.
*/
public CompletionResponse generate(String prompt, InferenceConfig config) {
try {
String payload = jsonMapper.writeValueAsString(Map.of(
"model", config.modelName().orElse(defaultModel),
"prompt", prompt,
"stream", false,
"options", Map.of(
"num_ctx", config.contextSize(),
"temperature", config.temperature(),
"num_gpu", config.gpuLayers()
)
));
HttpRequest request = HttpRequest.newBuilder()
.uri(URI.create(baseUrl + "/api/generate"))
.header("Content-Type", "application/json")
.POST(HttpRequest.BodyPublishers.ofString(payload))
.build();
HttpResponse<String> response = httpClient.send(request, HttpResponse.BodyHandlers.ofString());
if (response.statusCode() != 200) {
throw new InferenceException("Ollama API error: " + response.statusCode());
}
return jsonMapper.readValue(response.body(), CompletionResponse.class);
} catch (IOException | InterruptedException e) {
Thread.currentThread().interrupt();
throw new InferenceException("Inference request failed", e);
}
}
/**
* Executes an asynchronous completion request for non-blocking workflows.
*/
public CompletableFuture<CompletionResponse> generateAsync(String prompt, InferenceConfig config) {
return CompletableFuture.supplyAsync(() -> generate(prompt, config));
}
// Record definitions for type safety
public record InferenceConfig(
String modelName,
int contextSize,
double temperature,
int gpuLayers
) {
public InferenceConfig {
if (contextSize <= 0) throw new IllegalArgumentException("Context size must be positive");
if (temperature < 0.0 || temperature > 2.0) throw new IllegalArgumentException("Temperature out of range");
}
public java.util.Optional<String> modelName() {
return java.util.Optional.ofNullable(modelName);
}
}
public record CompletionResponse(
String model,
String response,
boolean done,
@JsonProperty("total_duration") long totalDuration,
@JsonProperty("load_duration") long loadDuration
) {}
public static class InferenceException extends RuntimeException {
public InferenceException(String message, Throwable cause) {
super(message, cause);
super.initCause(cause);
}
public InferenceException(String message) {
super(message);
}
}
}
Rationale for Design Choices:
- Records: Java records enforce immutability and reduce boilerplate for data transfer objects.
- Validation: Constructor validation in
InferenceConfig prevents invalid parameters from reaching the model, saving compute cycles.
- Async Support:
generateAsync allows the calling service to offload inference without blocking the request thread, essential for high-throughput applications.
- Error Handling: Custom
InferenceException wraps network and serialization errors, providing a unified failure mode for upstream handlers.
Step 4: Optimization via Modelfile
For production workloads, default parameters are insufficient. Create a Modelfile to tune LLaMA 3.3 for your specific use case.
FROM llama3.3:70b-instruct-q4_K_M
# Increase context window for long-document analysis
PARAMETER num_ctx 8192
# Optimize for deterministic output in classification tasks
PARAMETER temperature 0.1
# System prompt for consistent behavior
SYSTEM """
You are a precise analysis engine. Provide structured JSON output only.
Do not include conversational filler.
"""
Build and run the custom model:
ollama create my-analysis-model -f Modelfile
ollama run my-analysis-model
Pitfall Guide
1. VRAM Exhaustion and OOM Crashes
- Explanation: Loading a 70B model without quantization or with excessive context windows can exhaust GPU memory, causing Ollama to crash or swap to CPU, degrading performance by orders of magnitude.
- Fix: Always use quantized variants (e.g.,
q4_K_M). Monitor VRAM usage with nvidia-smi. If OOM occurs, reduce num_ctx or switch to a smaller parameter count.
2. Context Window Mismatch
- Explanation: The model's native context window may be larger than the default Ollama setting. If your application sends prompts exceeding the configured
num_ctx, the model will truncate input silently, leading to hallucinations or incomplete responses.
- Fix: Explicitly set
num_ctx in the Modelfile or API options. Ensure this value aligns with your application's maximum prompt length.
3. Blocking the Event Loop
- Explanation: Using synchronous HTTP clients in reactive frameworks (like Spring WebFlux) blocks the event loop, reducing throughput and increasing latency for all concurrent requests.
- Fix: Use non-blocking clients (
WebClient in Spring) or offload synchronous calls to a dedicated thread pool using CompletableFuture or virtual threads.
4. Cold Start Latency
- Explanation: Ollama unloads models from memory after a period of inactivity (default 5 minutes). Subsequent requests trigger a reload, causing high latency spikes.
- Fix: Configure
OLLAMA_KEEP_ALIVE environment variable to a higher value (e.g., -1 to keep loaded indefinitely) or implement a health-check pinger to keep the model warm during business hours.
5. Ignoring GPU Offloading
- Explanation: Running inference primarily on CPU when a GPU is available results in poor performance. Ollama attempts to offload layers automatically, but misconfiguration can limit this.
- Fix: Verify
num_gpu is set appropriately. In the API, you can force offloading by setting num_gpu to the number of layers. Ensure CUDA/Metal drivers are correctly installed.
6. Network Binding Restrictions
- Explanation: By default, Ollama binds to
127.0.0.1. If your application runs in a container or on a separate host, it cannot reach the Ollama service.
- Fix: Set
OLLAMA_HOST=0.0.0.0 in the Ollama environment variables to allow external connections. Secure this with firewall rules or reverse proxy authentication.
7. Version Drift in Models
- Explanation: Relying on the
latest tag can introduce breaking changes if the model is updated upstream, affecting reproducibility.
- Fix: Pin models to specific tags (e.g.,
llama3.3:70b-instruct-q4_K_M-v1.2). Use Ollama's manifest commands to track versions.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| High-Volume Chatbot | Local Ollama + LLaMA 3.3 8B | Low latency, high throughput, minimal VRAM needed. | High initial HW cost; near-zero marginal cost. |
| Complex Reasoning Agent | Local Ollama + LLaMA 3.3 70B Q4 | Superior reasoning capabilities; data stays on-prem. | Requires multi-GPU setup; significant capital expense. |
| Prototype / Low Traffic | Cloud API | No hardware management; pay-as-you-go. | Variable cost; scales with usage. |
| Regulated Data Processing | Local Ollama + Custom Modelfile | Strict data residency; full control over inference. | Compliance savings; HW amortization. |
Configuration Template
application.yml (Spring Boot)
ai:
inference:
base-url: "http://ollama-service:11434"
default-model: "my-analysis-model"
timeout-seconds: 30
retry:
max-attempts: 3
backoff-ms: 500
pool:
max-connections: 10
keep-alive-seconds: 60
Modelfile for Classification
FROM llama3.3:8b-instruct-q4_K_M
PARAMETER num_ctx 4096
PARAMETER temperature 0.0
PARAMETER stop "###"
SYSTEM """
Classify the input text into one of the following categories:
[Finance, Health, Technology, Other].
Output only the category name.
"""
Quick Start Guide
- Install Ollama: Download and install Ollama for your OS. Run
ollama serve to start the daemon.
- Pull Model: Execute
ollama pull llama3.3:70b-instruct-q4_K_M to download the model weights.
- Verify Service: Test connectivity with
curl http://localhost:11434/api/tags. Ensure the model appears in the list.
- Deploy Gateway: Integrate the
InferenceGateway class into your application. Configure the base URL and model name via environment variables.
- Run Inference: Call
gateway.generate("Analyze this payload...", config) and process the CompletionResponse. Monitor logs for latency and error rates.
This guide provides the technical foundation for deploying LLaMA 3.3 locally with Ollama. For specific hardware configurations or advanced quantization strategies, consult the official Ollama documentation and model card specifications.