Back to KB
Difficulty
Intermediate
Read Time
7 min

Running Local AI (Self-hosted) Coding Assistants in VS Code with Ollama and GitHub Copilot

By Codcompass Team··7 min read

Zero-Telemetry Coding: Leveraging VS Code BYOK for Local LLM Inference

Current Situation Analysis

Modern development workflows increasingly rely on AI assistance, but this introduces significant friction for teams handling sensitive intellectual property, regulated data, or strict compliance requirements. Sending code snippets to cloud-based LLMs creates data egress risks, potential licensing violations, and dependency on external vendor availability.

Historically, developers faced a binary choice: use cloud-based assistants with full feature parity but zero data control, or run local models with full privacy but fragmented tooling and poor IDE integration. The introduction of Bring Your Own Key (BYOK) support in GitHub Copilot fundamentally shifts this landscape. It allows VS Code to route inference requests to self-hosted endpoints while retaining the native chat interface, agent capabilities, and context awareness of the Copilot extension.

This capability is often misunderstood as a simple proxy feature. In reality, it enables a hybrid architecture where the IDE remains the control plane, but the inference plane is decoupled and localized. However, adoption is hindered by hardware complexity, configuration nuances, and misconceptions about feature parity. Production deployments require careful attention to VRAM allocation, network security, and the specific limitations of local model integration compared to cloud counterparts.

WOW Moment: Key Findings

The integration of BYOK with local inference engines like Ollama creates a distinct trade-off profile. The following comparison highlights the operational differences between standard cloud Copilot and a local BYOK configuration.

DimensionCloud CopilotLocal BYOK (Ollama)
Data EgressCode sent to external APIsZero egress; inference on-prem
Latency ProfileNetwork-dependent; variableHardware-bound; deterministic
Cost StructureRecurring subscriptionCapital expenditure (GPU/RAM)
Inline AutocompleteFull supportLimited or unsupported
Agent ModeFull tool accessDependent on model tool-calling capability
ComplianceVendor-managedDeveloper-managed

Why this matters: This table reveals that BYOK is not a 1:1 replacement for cloud Copilot. It is a specialized tool for privacy-critical workflows. The loss of inline autocomplete is a critical operational change that teams must account for in their productivity expectations. However, the ability to run models like qwen2.5-coder:32b locally provides coding performance that rivals cloud models for specific tasks, without a single byte of code leaving the infrastructure.

Core Solution

Implementing a local AI coding assistant requires synchronizing three components: the inference engine, the IDE extension, and the network configuration. The architecture flows from VS Code through the Copilot Chat extension to the Ollama API, which manages the loaded model on the host hardware.

Prerequisites and Version Alignment

Version compatibility is strict. Mismatched versions can cause model discovery failures or API incompatibilities.

  • VS Code: Version 1.113 or higher.
  • GitHub Copilot Chat Extension: Version 0.41.0 or higher.
  • Ollama: Version 0.18.3 or higher.

Step 1: Inference Engine Deployment

Install Ollama using the official distribution method. For production environments, configure the service to manage resources correctly.

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Enable and start the service
sudo systemctl enable --now ollama

Production Insight: By default, Ollama binds to 127.0.0.1. If you plan to host the inference server on a separate node or require remote access within a trusted network, you must override the service environment.

Create a systemd drop-in to configure the bind address:

sudo mkdir -p /etc/systemd/system/ollama.service.d
cat <<EOF | sudo tee /etc/systemd/system/ollama.service.d/override.conf
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
EOF

sudo systemctl daemon-reload
sudo systemctl restart ollama

Step 2: Model Selection and Hardware Mapping

Model selection must align with available VRAM. Running a model that exceeds VRAM causes swapping to system RAM, resulting in unusable latency.

Model VariantParameter SizeVRAM RequirementHardware ProfileUse Case
qwen2.5-coder:14b14B10–12 GBRTX 3090 / 4090Balanced coding assistant
qwen2.5-coder:32b32B24 GB+RTX 4090 / A5000High-fidelity reasoning
deepseek-coder-v2Varies24 GB+RTX 4090 / A5000Complex logic generation
phi4OptimizedCPU/RAMModern x86 CPULow-resource environments
phi4-miniOptimized~3 GB RAMModern x86 CPULightweight tasks

Note: Quantization levels significantly impact memory usage. The values above assume standard quantization. Always verify VRAM availability before loading large mod

els.

Pull and verify the model:

# Pull the recommended coding model
ollama pull qwen2.5-coder:14b

# Verify the model is available
ollama list | grep qwen2.5-coder

Step 3: VS Code Integration

The integration relies on the Chat: Manage Language Models command. This step registers the local endpoint with the Copilot extension.

  1. Open the Command Palette (Ctrl+Shift+P or Cmd+Shift+P).
  2. Execute Chat: Manage Language Models.
  3. Select Add Models and choose Ollama.
  4. VS Code auto-detects http://localhost:11434. If using a remote endpoint, input the secure URL (e.g., https://ollama.internal.domain).

Critical Configuration: Ensure the model entry indicates Tools support. Without tool capability, Copilot cannot execute agent actions or interact with the workspace effectively.

Step 4: Session Targeting

Once configured, you must explicitly direct Copilot to use the local model.

  1. Open Copilot Chat.
  2. Set the session target to Local.
  3. Use the model picker to select your Ollama model (e.g., qwen2.5-coder:14b).

Prompts sent in this session will route to the local API. Verify traffic by monitoring Ollama logs:

journalctl -u ollama -f | grep "POST /v1/chat/completions"

Step 5: Secure Remote Access (Optional)

For homelab or team setups, expose Ollama via a reverse proxy with authentication. Direct exposure of the API is a security risk.

Nginx Configuration with Basic Auth:

upstream ollama_backend {
    server 127.0.0.1:11434;
}

server {
    listen 443 ssl;
    server_name ai-coding.internal;

    ssl_certificate /etc/ssl/certs/ai-coding.crt;
    ssl_certificate_key /etc/ssl/private/ai-coding.key;

    # Enforce authentication
    auth_basic "Restricted AI Access";
    auth_basic_user_file /etc/nginx/.htpasswd;

    location / {
        proxy_pass http://ollama_backend;
        proxy_http_version 1.1;
        proxy_set_header Host $host;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
        
        # Timeout adjustments for long inference
        proxy_read_timeout 300s;
        proxy_send_timeout 300s;
    }
}

Generate the password file:

sudo htpasswd -c /etc/nginx/.htpasswd aiuser

Pitfall Guide

1. Unbound API Exposure

Explanation: Configuring Ollama to bind to 0.0.0.0 without firewall rules or authentication exposes the inference engine to the network. Attackers can exploit this to run unauthorized inference or access internal data. Fix: Always use firewall ACLs, Tailscale/WireGuard tunnels, or reverse proxy authentication. Never expose port 11434 to the public internet.

2. VRAM Miscalculation

Explanation: Attempting to load a 32B model on a GPU with 12GB VRAM results in out-of-memory errors or severe performance degradation due to CPU offloading. Fix: Consult VRAM requirements before pulling models. Use nvidia-smi to monitor memory usage. If VRAM is insufficient, switch to smaller models like phi4-mini or higher quantization levels.

3. Inline Autocomplete Expectation

Explanation: BYOK with local models does not fully support inline autocomplete. Developers may expect ghost text suggestions similar to cloud Copilot. Fix: Manage expectations. Local BYOK is optimized for Chat and Agent mode. Inline autocomplete remains a cloud-centric feature in the current implementation.

4. Session Target Drift

Explanation: VS Code may default to cloud models even after local configuration, causing confusion about where prompts are routed. Fix: Always verify the session target is set to Local in the Copilot Chat interface. Check Ollama logs to confirm requests are arriving locally.

5. Tool Calling Failure

Explanation: The model must support tool calling for agent mode to function. If the model entry in VS Code does not show "Tools", agent capabilities will be disabled. Fix: Ensure the selected model supports tool calling. Verify the model configuration in Chat: Manage Language Models indicates tool support.

6. Quantization Ignorance

Explanation: Different quantization levels (e.g., Q4_K_M vs Q8_0) affect model size and quality. Users may pull a model variant that doesn't match their hardware constraints. Fix: Specify quantization when pulling models if needed. Understand that lower quantization reduces VRAM usage but may impact code generation quality.

7. Nginx Timeout Errors

Explanation: Local inference can be slower than cloud APIs, especially on CPU or lower-end GPUs. Default Nginx timeouts may cut off long-running requests. Fix: Increase proxy_read_timeout and proxy_send_timeout in the Nginx configuration to accommodate inference latency.

Production Bundle

Action Checklist

  • Verify VS Code (1.113+), Copilot Extension (0.41.0+), and Ollama (0.18.3+) versions.
  • Install Ollama and enable the systemd service.
  • Assess hardware VRAM and select an appropriate model variant.
  • Pull the model and verify availability via ollama list.
  • Configure VS Code via Chat: Manage Language Models to add Ollama.
  • Confirm the model entry shows "Tools" support for agent capabilities.
  • Set Copilot Chat session target to Local and select the model.
  • Test inference with a code explanation prompt and monitor Ollama logs.

Decision Matrix

ScenarioRecommended ApproachWhyCost Impact
Enterprise ComplianceLocal BYOKZero data egress; full control over inference.High hardware CapEx.
Solo Developer, Low SpecCloud CopilotNo hardware requirements; full feature set.Monthly subscription.
Hybrid WorkflowBYOK for Chat, Cloud for AutocompleteBalances privacy for sensitive tasks with productivity features.Mixed costs.
Team HomelabRemote Ollama + Nginx ProxyCentralized inference; shared GPU resources.Hardware + Network setup.

Configuration Template

Ollama Systemd Override for Remote Binding:

# /etc/systemd/system/ollama.service.d/override.conf
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_KEEP_ALIVE=24h"

Secure Nginx Reverse Proxy:

upstream ollama_backend {
    server 127.0.0.1:11434;
}

server {
    listen 443 ssl;
    server_name ai-coding.internal;

    ssl_certificate /etc/ssl/certs/ai-coding.crt;
    ssl_certificate_key /etc/ssl/private/ai-coding.key;

    auth_basic "Restricted AI Access";
    auth_basic_user_file /etc/nginx/.htpasswd;

    location / {
        proxy_pass http://ollama_backend;
        proxy_http_version 1.1;
        proxy_set_header Host $host;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
        
        proxy_read_timeout 300s;
        proxy_send_timeout 300s;
    }
}

Quick Start Guide

  1. Install Ollama: Run curl -fsSL https://ollama.com/install.sh | sh and start the service.
  2. Pull Model: Execute ollama pull qwen2.5-coder:14b to load the coding model.
  3. Configure VS Code: Use Chat: Manage Language Models to add Ollama and verify tool support.
  4. Activate Local Session: Set Copilot Chat target to Local and select the model.
  5. Verify: Send a test prompt and confirm responses originate from the local model via Ollama logs.