Running Local AI (Self-hosted) Coding Assistants in VS Code with Ollama and GitHub Copilot
Zero-Telemetry Coding: Leveraging VS Code BYOK for Local LLM Inference
Current Situation Analysis
Modern development workflows increasingly rely on AI assistance, but this introduces significant friction for teams handling sensitive intellectual property, regulated data, or strict compliance requirements. Sending code snippets to cloud-based LLMs creates data egress risks, potential licensing violations, and dependency on external vendor availability.
Historically, developers faced a binary choice: use cloud-based assistants with full feature parity but zero data control, or run local models with full privacy but fragmented tooling and poor IDE integration. The introduction of Bring Your Own Key (BYOK) support in GitHub Copilot fundamentally shifts this landscape. It allows VS Code to route inference requests to self-hosted endpoints while retaining the native chat interface, agent capabilities, and context awareness of the Copilot extension.
This capability is often misunderstood as a simple proxy feature. In reality, it enables a hybrid architecture where the IDE remains the control plane, but the inference plane is decoupled and localized. However, adoption is hindered by hardware complexity, configuration nuances, and misconceptions about feature parity. Production deployments require careful attention to VRAM allocation, network security, and the specific limitations of local model integration compared to cloud counterparts.
WOW Moment: Key Findings
The integration of BYOK with local inference engines like Ollama creates a distinct trade-off profile. The following comparison highlights the operational differences between standard cloud Copilot and a local BYOK configuration.
| Dimension | Cloud Copilot | Local BYOK (Ollama) |
|---|---|---|
| Data Egress | Code sent to external APIs | Zero egress; inference on-prem |
| Latency Profile | Network-dependent; variable | Hardware-bound; deterministic |
| Cost Structure | Recurring subscription | Capital expenditure (GPU/RAM) |
| Inline Autocomplete | Full support | Limited or unsupported |
| Agent Mode | Full tool access | Dependent on model tool-calling capability |
| Compliance | Vendor-managed | Developer-managed |
Why this matters: This table reveals that BYOK is not a 1:1 replacement for cloud Copilot. It is a specialized tool for privacy-critical workflows. The loss of inline autocomplete is a critical operational change that teams must account for in their productivity expectations. However, the ability to run models like qwen2.5-coder:32b locally provides coding performance that rivals cloud models for specific tasks, without a single byte of code leaving the infrastructure.
Core Solution
Implementing a local AI coding assistant requires synchronizing three components: the inference engine, the IDE extension, and the network configuration. The architecture flows from VS Code through the Copilot Chat extension to the Ollama API, which manages the loaded model on the host hardware.
Prerequisites and Version Alignment
Version compatibility is strict. Mismatched versions can cause model discovery failures or API incompatibilities.
- VS Code: Version 1.113 or higher.
- GitHub Copilot Chat Extension: Version 0.41.0 or higher.
- Ollama: Version 0.18.3 or higher.
Step 1: Inference Engine Deployment
Install Ollama using the official distribution method. For production environments, configure the service to manage resources correctly.
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Enable and start the service
sudo systemctl enable --now ollama
Production Insight: By default, Ollama binds to 127.0.0.1. If you plan to host the inference server on a separate node or require remote access within a trusted network, you must override the service environment.
Create a systemd drop-in to configure the bind address:
sudo mkdir -p /etc/systemd/system/ollama.service.d
cat <<EOF | sudo tee /etc/systemd/system/ollama.service.d/override.conf
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
EOF
sudo systemctl daemon-reload
sudo systemctl restart ollama
Step 2: Model Selection and Hardware Mapping
Model selection must align with available VRAM. Running a model that exceeds VRAM causes swapping to system RAM, resulting in unusable latency.
| Model Variant | Parameter Size | VRAM Requirement | Hardware Profile | Use Case |
|---|---|---|---|---|
qwen2.5-coder:14b | 14B | 10–12 GB | RTX 3090 / 4090 | Balanced coding assistant |
qwen2.5-coder:32b | 32B | 24 GB+ | RTX 4090 / A5000 | High-fidelity reasoning |
deepseek-coder-v2 | Varies | 24 GB+ | RTX 4090 / A5000 | Complex logic generation |
phi4 | Optimized | CPU/RAM | Modern x86 CPU | Low-resource environments |
phi4-mini | Optimized | ~3 GB RAM | Modern x86 CPU | Lightweight tasks |
Note: Quantization levels significantly impact memory usage. The values above assume standard quantization. Always verify VRAM availability before loading large mod
els.
Pull and verify the model:
# Pull the recommended coding model
ollama pull qwen2.5-coder:14b
# Verify the model is available
ollama list | grep qwen2.5-coder
Step 3: VS Code Integration
The integration relies on the Chat: Manage Language Models command. This step registers the local endpoint with the Copilot extension.
- Open the Command Palette (
Ctrl+Shift+PorCmd+Shift+P). - Execute
Chat: Manage Language Models. - Select Add Models and choose Ollama.
- VS Code auto-detects
http://localhost:11434. If using a remote endpoint, input the secure URL (e.g.,https://ollama.internal.domain).
Critical Configuration: Ensure the model entry indicates Tools support. Without tool capability, Copilot cannot execute agent actions or interact with the workspace effectively.
Step 4: Session Targeting
Once configured, you must explicitly direct Copilot to use the local model.
- Open Copilot Chat.
- Set the session target to Local.
- Use the model picker to select your Ollama model (e.g.,
qwen2.5-coder:14b).
Prompts sent in this session will route to the local API. Verify traffic by monitoring Ollama logs:
journalctl -u ollama -f | grep "POST /v1/chat/completions"
Step 5: Secure Remote Access (Optional)
For homelab or team setups, expose Ollama via a reverse proxy with authentication. Direct exposure of the API is a security risk.
Nginx Configuration with Basic Auth:
upstream ollama_backend {
server 127.0.0.1:11434;
}
server {
listen 443 ssl;
server_name ai-coding.internal;
ssl_certificate /etc/ssl/certs/ai-coding.crt;
ssl_certificate_key /etc/ssl/private/ai-coding.key;
# Enforce authentication
auth_basic "Restricted AI Access";
auth_basic_user_file /etc/nginx/.htpasswd;
location / {
proxy_pass http://ollama_backend;
proxy_http_version 1.1;
proxy_set_header Host $host;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
# Timeout adjustments for long inference
proxy_read_timeout 300s;
proxy_send_timeout 300s;
}
}
Generate the password file:
sudo htpasswd -c /etc/nginx/.htpasswd aiuser
Pitfall Guide
1. Unbound API Exposure
Explanation: Configuring Ollama to bind to 0.0.0.0 without firewall rules or authentication exposes the inference engine to the network. Attackers can exploit this to run unauthorized inference or access internal data.
Fix: Always use firewall ACLs, Tailscale/WireGuard tunnels, or reverse proxy authentication. Never expose port 11434 to the public internet.
2. VRAM Miscalculation
Explanation: Attempting to load a 32B model on a GPU with 12GB VRAM results in out-of-memory errors or severe performance degradation due to CPU offloading.
Fix: Consult VRAM requirements before pulling models. Use nvidia-smi to monitor memory usage. If VRAM is insufficient, switch to smaller models like phi4-mini or higher quantization levels.
3. Inline Autocomplete Expectation
Explanation: BYOK with local models does not fully support inline autocomplete. Developers may expect ghost text suggestions similar to cloud Copilot. Fix: Manage expectations. Local BYOK is optimized for Chat and Agent mode. Inline autocomplete remains a cloud-centric feature in the current implementation.
4. Session Target Drift
Explanation: VS Code may default to cloud models even after local configuration, causing confusion about where prompts are routed. Fix: Always verify the session target is set to Local in the Copilot Chat interface. Check Ollama logs to confirm requests are arriving locally.
5. Tool Calling Failure
Explanation: The model must support tool calling for agent mode to function. If the model entry in VS Code does not show "Tools", agent capabilities will be disabled.
Fix: Ensure the selected model supports tool calling. Verify the model configuration in Chat: Manage Language Models indicates tool support.
6. Quantization Ignorance
Explanation: Different quantization levels (e.g., Q4_K_M vs Q8_0) affect model size and quality. Users may pull a model variant that doesn't match their hardware constraints. Fix: Specify quantization when pulling models if needed. Understand that lower quantization reduces VRAM usage but may impact code generation quality.
7. Nginx Timeout Errors
Explanation: Local inference can be slower than cloud APIs, especially on CPU or lower-end GPUs. Default Nginx timeouts may cut off long-running requests.
Fix: Increase proxy_read_timeout and proxy_send_timeout in the Nginx configuration to accommodate inference latency.
Production Bundle
Action Checklist
- Verify VS Code (1.113+), Copilot Extension (0.41.0+), and Ollama (0.18.3+) versions.
- Install Ollama and enable the systemd service.
- Assess hardware VRAM and select an appropriate model variant.
- Pull the model and verify availability via
ollama list. - Configure VS Code via
Chat: Manage Language Modelsto add Ollama. - Confirm the model entry shows "Tools" support for agent capabilities.
- Set Copilot Chat session target to Local and select the model.
- Test inference with a code explanation prompt and monitor Ollama logs.
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|---|---|---|
| Enterprise Compliance | Local BYOK | Zero data egress; full control over inference. | High hardware CapEx. |
| Solo Developer, Low Spec | Cloud Copilot | No hardware requirements; full feature set. | Monthly subscription. |
| Hybrid Workflow | BYOK for Chat, Cloud for Autocomplete | Balances privacy for sensitive tasks with productivity features. | Mixed costs. |
| Team Homelab | Remote Ollama + Nginx Proxy | Centralized inference; shared GPU resources. | Hardware + Network setup. |
Configuration Template
Ollama Systemd Override for Remote Binding:
# /etc/systemd/system/ollama.service.d/override.conf
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_KEEP_ALIVE=24h"
Secure Nginx Reverse Proxy:
upstream ollama_backend {
server 127.0.0.1:11434;
}
server {
listen 443 ssl;
server_name ai-coding.internal;
ssl_certificate /etc/ssl/certs/ai-coding.crt;
ssl_certificate_key /etc/ssl/private/ai-coding.key;
auth_basic "Restricted AI Access";
auth_basic_user_file /etc/nginx/.htpasswd;
location / {
proxy_pass http://ollama_backend;
proxy_http_version 1.1;
proxy_set_header Host $host;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_read_timeout 300s;
proxy_send_timeout 300s;
}
}
Quick Start Guide
- Install Ollama: Run
curl -fsSL https://ollama.com/install.sh | shand start the service. - Pull Model: Execute
ollama pull qwen2.5-coder:14bto load the coding model. - Configure VS Code: Use
Chat: Manage Language Modelsto add Ollama and verify tool support. - Activate Local Session: Set Copilot Chat target to Local and select the model.
- Verify: Send a test prompt and confirm responses originate from the local model via Ollama logs.
