Weekend Experiment: Free Qwen as a Personal API. Here Is What Actually Happened.
Current Situation Analysis
The rising cost of commercial LLM APIs has created a structural bottleneck for developers building personal tools, internal utilities, or early-stage prototypes. Token-based pricing models force engineers to either absorb unpredictable monthly bills or artificially throttle feature sets. Meanwhile, cloud notebook platforms like Kaggle offer substantial free compute resources, but they operate under strict outbound-only networking policies. This creates a fundamental mismatch: you have inference capacity, but no way to route external requests to it.
The core misunderstanding lies in treating cloud notebooks as traditional backend servers. They are ephemeral, stateless environments designed for batch processing, not persistent API endpoints. Attempting to expose a local HTTP server via tunneling services violates platform terms and introduces instability. The overlooked reality is that compute nodes don't need to accept inbound connections; they only need to maintain a persistent outbound channel to a stateful relay. By inverting the request flow, developers can transform free cloud GPUs into private inference endpoints without violating platform constraints.
Empirical data supports this approach. Kaggle's free tier allocates 30 GPU hours weekly across two NVIDIA T4 GPUs (15GB VRAM each). Mid-tier commercial APIs typically charge $0.50β$2.00 per million input tokens, with rate limits that throttle concurrent requests. The free relay architecture eliminates token costs entirely but introduces latency variance, session expiration constraints, and manual restart requirements. For personal workflows, internal tooling, or privacy-sensitive data processing, the trade-off heavily favors the zero-cost model.
WOW Moment: Key Findings
The architectural inversion reveals a clear divergence in operational characteristics compared to traditional API consumption. The following comparison isolates the practical implications of routing inference through a free cloud notebook versus commercial providers or self-hosted infrastructure.
| Approach | Monthly Cost | First-Token Latency | Data Sovereignty | Scalability | Operational Overhead |
|---|---|---|---|---|---|
| Commercial API (e.g., OpenAI, Anthropic) | $50β$500+ | 200β800ms | Low (data logged) | High (auto-scaling) | Low (managed) |
| Self-Hosted VPS (A100/H100) | $200β$1,200+ | 100β400ms | High | Medium (manual scaling) | High (infra management) |
| Kaggle + CF DO Relay | $0 | 2,000β15,000ms | High (local compute) | Low (30hr/week limit) | Medium (session monitoring) |
This finding matters because it decouples inference capability from subscription economics. Developers can validate AI-driven features, run private document analysis, or build codebase RAG pipelines without exposing sensitive data to third-party providers. The latency penalty is acceptable for asynchronous or batch-oriented workflows, and the cost elimination enables rapid iteration that would otherwise be financially prohibitive.
Core Solution
The architecture relies on a persistent WebSocket relay hosted on Cloudflare Durable Objects (DO). The DO acts as a stateful bridge: it receives HTTP requests from clients, forwards payloads to the Kaggle notebook via WebSocket, caches responses, and returns results. The notebook operates as an outbound compute node, maintaining a single persistent connection to the DO.
Step 1: Kaggle Compute Node Setup
The notebook must handle model initialization, event loop patching, and WebSocket client logic. Quantization is mandatory to fit within T4 VRAM constraints.
import torch
import nest_asyncio
import asyncio
import websockets
import json
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers import BitsAndBytesConfig
# Patch Jupyter's event loop to allow nested async operations
nest_asyncio.apply()
MODEL_REPO = "Qwen/Qwen3-8B"
WS_ENDPOINT = "wss://your-worker.workers.dev/relay"
# Configure 4-bit quantization to reduce VRAM footprint from ~16GB to ~5GB
quant_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_quant_type="nf4"
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_REPO)
model = AutoModelForCausalLM.from_pretrained(
MODEL_REPO,
quantization_config=quant_config,
device_map="auto",
low_cpu_mem_usage=True
)
async def inference_loop():
async with websockets.connect(WS_ENDPOINT) as socket:
print("Compute node connected to relay")
async for raw_message in socket:
payload = json.loads(raw_message)
request_id = payload.get("id")
prompt = payload.get("prompt")
# Apply chat template for instruct behavior
messages = [{"role": "user", "content": prompt}]
input_ids = tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
inputs = tokenizer(input_ids, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.7)
response_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
await socket.send(json.dumps({
"id": request_id,
"status": "completed",
"result": response_text
}))
asyncio.create_task(inference_loop())
Architecture Rationale:
device_map="auto"distributes model layers across both T4 GPUs, preventing single-card OOM errors.- 4-bit NF4 quantization reduces memory pressure while preserving instruction-following capability.
nest_asyncio.apply()resolves Jupyter's pre-existing event loop conflict, enabling WebSocket client operations without blocking the notebook kernel.- Chat template injection replaces the need for a separate instruct-weight repository.
Step 2: Cloudflare Durable Object Relay
The DO manages WebSocket lifecycle, request routing, and response caching. Cloudflare's Durable Object Hibernation API is critical here, as standard in-memory listeners are destroyed during idle eviction.
import { DurableObject } from "cloudflare:workers";
interface RequestPayload {
id: string;
prompt: string;
timestamp: number;
}
interface CachedResponse {
result: string;
expiresAt: number;
}
export class InferenceBridgeDO extends DurableObject {
private computeSocket: WebSocket | null = null;
private pendingRequests: Map<string, (response: string) => void> = new Map();
async fetch(request: Request): Promise<Response> {
if (request.headers.get("Upgrade")?.toLowerCase() === "websocket") {
const [client, server] = Object.values(new WebSocketPair());
this.computeSocket = server;
server.accept();
return new Response(null, { status: 101, webSocket: client });
}
const body: RequestPayload = await request.json();
const cacheKey = await this.hashRequest(body.prompt);
// Check durable storage for cached results
const cached = await this.ctx.storage.get<CachedResponse>(cacheKey);
if (cached && Date.now() < cached.expiresAt) {
return new Response(JSON.stringify({ result: cached.result, source: "cache" }), {
headers: { "Content-Type": "application/json" }
});
}
// Forward to compute node and await response
const responsePromise = new Promise<string>((resolve) => {
this.pendingRequests.set(body.id, resolve);
});
if (this.computeSocket?.readyState === WebSocket.READY_STATE_OPEN) {
this.computeSocket.send(JSON.stringify(body));
} else {
return new Response("Compute node offline", { status: 503 });
}
const result = await responsePromise;
// Cache result with 60s TTL
await this.ctx.storage.put(cacheKey, {
result,
expiresAt: Date.now() + 60000
});
return new Response(JSON.stringify({ result, source: "live" }), {
headers: { "Content-Type": "application/json" }
});
}
// Hibernation API: survives DO memory eviction
webSocketMessage(ws: WebSocket, message: string | ArrayBuffer): void {
const payload = JSON.parse(message as string);
const resolver = this.pendingRequests.get(payload.id);
if (resolver) {
resolver(payload.result);
this.pendingRequests.delete(payload.id);
}
}
webSocketClose(ws: WebSocket, code: number, reason: string, wasClean: boolean): void {
this.computeSocket = null;
console.warn(`Compute node disconnected: ${reason}`);
}
webSocketError(ws: WebSocket, error: unknown): void {
this.computeSocket = null;
console.error(`WebSocket error:`, error);
}
private async hashRequest(prompt: string): Promise<string> {
const encoder = new TextEncoder();
const data = encoder.encode(prompt);
const hashBuffer = await crypto.subtle.digest("SHA-256", data);
return Array.from(new Uint8Array(hashBuffer)).map(b => b.toString(16).padStart(2, "0")).join("");
}
}
Architecture Rationale:
- The Hibernation API (
webSocketMessage,webSocketClose,webSocketError) replaces traditionaladdEventListenercalls. Cloudflare automatically invokes these methods after memory eviction, preserving message routing. this.ctx.storagepersists cache entries and connection state across evictions, preventing health endpoint drift.- SHA-256 request hashing enables deterministic caching. Identical prompts return cached results instantly, masking the 5β15 second inference latency.
- The pending request map bridges the asynchronous gap between HTTP client timeouts and notebook generation time.
Pitfall Guide
1. Incorrect Repository Tagging
Explanation: Developers often search for Qwen3-8B-Instruct on Hugging Face. The instruct variant is not a separate repository; it's activated via the chat template during inference.
Fix: Use Qwen/Qwen3-8B and apply apply_chat_template() with a user/system message structure before tokenization.
2. VRAM Fragmentation & OOM on Single T4
Explanation: Default device_map="cuda:0" forces the entire fp16 model (~16GB) onto one T4 GPU (15GB VRAM), triggering out-of-memory crashes.
Fix: Set device_map="auto" to enable tensor parallelism across both GPUs, or enforce 4-bit quantization via BitsAndBytesConfig to reduce footprint to ~5GB.
3. Jupyter Event Loop Collision
Explanation: Kaggle notebooks run an active asyncio event loop. Calling asyncio.run() or creating a new loop raises RuntimeError: This event loop is already running.
Fix: Import and apply nest_asyncio.apply() before initializing WebSocket clients or async inference loops.
4. Durable Object Memory Eviction
Explanation: Cloudflare DOs evict from memory after ~30 seconds of inactivity. In-memory WebSocket listeners and state variables reset, causing silent message drops and false health checks.
Fix: Migrate to the Hibernation API. Declare webSocketMessage, webSocketClose, and webSocketError as class methods. Persist timestamps and connection flags to this.ctx.storage.
5. Asynchronous Timeout Mismatch
Explanation: Standard HTTP clients timeout after 10β15 seconds. LLM generation for medium-length prompts frequently exceeds 30β80 seconds, resulting in client-side ETIMEDOUT errors.
Fix: Implement deterministic request hashing on the relay. Store responses in durable storage with a short TTL. Subsequent identical requests bypass inference and return cached results instantly.
6. Ephemeral Session Limits
Explanation: Kaggle enforces a 12-hour maximum runtime per session. Notebooks terminate automatically, severing the WebSocket connection and requiring manual restart.
Fix: Implement a lightweight watchdog script that monitors notebook uptime and triggers a session restart via Kaggle API, or design the client to handle 503 Service Unavailable with exponential backoff.
7. WebSocket Frame Fragmentation
Explanation: Large JSON payloads or streaming tokens can split across multiple WebSocket frames, causing deserialization errors on the receiver.
Fix: Enforce JSON-RPC framing or chunked message protocols. Validate message.length and buffer incomplete frames before parsing. Alternatively, limit max_new_tokens and stream via multiple small payloads.
Production Bundle
Action Checklist
- Verify model repository: Use
Qwen/Qwen3-8Band apply chat templates instead of hunting for non-existent instruct repos. - Configure quantization: Apply
BitsAndBytesConfig(load_in_4bit=True)anddevice_map="auto"to prevent T4 VRAM exhaustion. - Patch event loop: Run
nest_asyncio.apply()immediately after imports in the Kaggle notebook. - Implement DO Hibernation API: Replace
addEventListenerwithwebSocketMessage,webSocketClose, andwebSocketErrorclass methods. - Persist state to durable storage: Move connection timestamps, cache entries, and pending request maps to
this.ctx.storage. - Add deterministic request caching: Hash prompts with SHA-256, store results with TTL, and return cached responses to mask inference latency.
- Monitor GPU allocation: Track Kaggle's 30-hour weekly limit and implement session watchdogs to handle 12-hour expiration.
- Secure the relay: Add API key validation or JWT verification to the Cloudflare Worker before forwarding requests to the DO.
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|---|---|---|
| Personal AI assistant / internal tooling | Kaggle + CF DO Relay | Zero cost, acceptable latency, full data control | $0 |
| Early-stage MVP validation | Kaggle + CF DO Relay | Rapid prototyping without API budget commitment | $0 |
| Production SaaS with SLA requirements | Commercial API or dedicated VPS | Predictable latency, auto-scaling, guaranteed uptime | $50β$1,200+/mo |
| Privacy-critical document analysis | Kaggle + CF DO Relay | Data never leaves your compute environment | $0 |
| High-concurrency public endpoint | Commercial API or self-hosted cluster | WebSocket relay cannot handle >50 concurrent requests | $100β$500+/mo |
Configuration Template
Cloudflare Worker Entry Point (index.ts)
import { InferenceBridgeDO } from "./durable-object";
export default {
async fetch(request: Request, env: Env): Promise<Response> {
const url = new URL(request.url);
const id = env.INFERENCE_BRIDGE.idFromName("primary-relay");
const stub = env.INFERENCE_BRIDGE.get(id);
return stub.fetch(request);
}
};
Kaggle Notebook Initialization Cell
import nest_asyncio
nest_asyncio.apply()
# Verify GPU allocation
import torch
print(f"Available GPUs: {torch.cuda.device_count()}")
print(f"VRAM per GPU: {torch.cuda.get_device_properties(0).total_mem / 1e9:.2f} GB")
Quick Start Guide
- Deploy the Durable Object: Create a new Cloudflare Worker, paste the
InferenceBridgeDOclass, and bind it inwrangler.tomlunderdurable_objects.bindings. - Configure the Relay Endpoint: Note the Worker URL (e.g.,
https://your-worker.workers.dev/relay). This becomes the WebSocket target for the notebook. - Launch Kaggle Notebook: Create a new notebook, enable GPU acceleration, and paste the compute node code. Replace
WS_ENDPOINTwith your Worker URL. - Execute & Verify: Run the notebook cell to establish the WebSocket connection. Send a test
curlrequest to the Worker's HTTP endpoint. The first response will take 10β30 seconds; subsequent identical requests will return instantly from cache.
