Weekend Experiment: Free Qwen as a Personal API. Here Is What Actually Happened.

Current Situation Analysis

The rising cost of commercial LLM APIs has created a structural bottleneck for developers building personal tools, internal utilities, or early-stage prototypes. Token-based pricing models force engineers to either absorb unpredictable monthly bills or artificially throttle feature sets. Meanwhile, cloud notebook platforms like Kaggle offer substantial free compute resources, but they operate under strict outbound-only networking policies. This creates a fundamental mismatch: you have inference capacity, but no way to route external requests to it.

The core misunderstanding lies in treating cloud notebooks as traditional backend servers. They are ephemeral, stateless environments designed for batch processing, not persistent API endpoints. Attempting to expose a local HTTP server via tunneling services violates platform terms and introduces instability. The overlooked reality is that compute nodes don't need to accept inbound connections; they only need to maintain a persistent outbound channel to a stateful relay. By inverting the request flow, developers can transform free cloud GPUs into private inference endpoints without violating platform constraints.

Empirical data supports this approach. Kaggle's free tier allocates 30 GPU hours weekly across two NVIDIA T4 GPUs (15GB VRAM each). Mid-tier commercial APIs typically charge $0.50–$2.00 per million input tokens, with rate limits that throttle concurrent requests. The free relay architecture eliminates token costs entirely but introduces latency variance, session expiration constraints, and manual restart requirements. For personal workflows, internal tooling, or privacy-sensitive data processing, the trade-off heavily favors the zero-cost model.

WOW Moment: Key Findings

The architectural inversion reveals a clear divergence in operational characteristics compared to traditional API consumption. The following comparison isolates the practical implications of routing inference through a free cloud notebook versus commercial providers or self-hosted infrastructure.

Approach	Monthly Cost	First-Token Latency	Data Sovereignty	Scalability	Operational Overhead
Commercial API (e.g., OpenAI, Anthropic)	$50–$500+	200–800ms	Low (data logged)	High (auto-scaling)	Low (managed)
Self-Hosted VPS (A100/H100)	$200–$1,200+	100–400ms	High	Medium (manual scaling)	High (infra management)
Kaggle + CF DO Relay	$0	2,000–15,000ms	High (local compute)	Low (30hr/week limit)	Medium (session monitoring)

This finding matters because it decouples inference capability from subscription economics. Developers can validate AI-driven features, run private document analysis, or build codebase RAG pipelines without exposing sensitive data to third-party providers. The latency penalty is acceptable for asynchronous or batch-oriented workflows, and the cost elimination enables rapid iteration that would otherwise be financially prohibitive.

Core Solution

The architecture relies on a persistent WebSocket relay hosted on Cloudflare Durable Objects (DO). The DO acts as a stateful bridge: it receives HTTP requests from clients, forwards payloads to the Kaggle notebook via WebSocket, caches responses, and returns results. The notebook operates as an outbound compute node, maintaining a single persistent connection to the DO.

Step 1: Kaggle Compute Node Setup

The notebook must handle model initialization, event loop patching, and WebSocket client logic. Quantization is mandatory to fit within T4 VRAM constraints.

import torch
import nest_asyncio
import asyncio
import websockets
import json
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers import BitsAndBytesConfig

# Patch Jupyter's event loop to allow nested async operations
nest_asyncio.apply()

MODEL_REPO = "Qwen/Qwen3-8B"
WS_ENDPOINT = "wss://your-worker.workers.dev/relay"

# Configure 4-bit quantization to reduce VRAM footprint from ~16GB to ~5GB
quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4"
)

tokenizer = AutoTokenizer.from_pretrained(MODEL_REPO)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_REPO,
    quantization_config=quant_config,
    device_map="auto",
    low_cpu_mem_usage=True
)

async def inference_loop():
    async with websockets.connect(WS_ENDPOINT) as socket:
        print("Compute node connected to relay")
        async for raw_message in socket:
            payload = json.loads(raw_message)
            request_id = payload.get("id")
            prompt = payload.get("prompt")
            
            # Apply chat template for instruct behavior
            messages = [{"role": "user", "content": prompt}]
            input_ids = tokenizer.apply_chat_template(
                messages, tokenize=False, add_generation_prompt=True
            )
            inputs = tokenizer(input_ids, return_tensors="pt").to(model.device)
            
            outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.7)
            response_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
            
            await socket.send(json.dumps({
                "id": request_id,
                "status": "completed",
                "result": response_text
            }))

asyncio.create_task(inference_loop())

Architecture Rationale:

device_map="auto" distributes model layers across both T4 GPUs, preventing single-card OOM errors.
4-bit NF4 quantization reduces memory pressure while preserving instruction-following capability.
nest_asyncio.apply() resolves Jupyter's pre-existing event loop conflict, enabling WebSocket client operations without blocking the notebook kernel.
Chat template injection replaces the need for a separate instruct-weight repository.

Step 2: Cloudflare Durable Object Relay

The DO manages WebSocket lifecycle, request routing, and response caching. Cloudflare's Durable Object Hibernation API is critical here, as standard in-memory listeners are destroyed during idle eviction.

import { DurableObject } from "cloudflare:workers";

interface RequestPayload {
  id: string;
  prompt: string;
  timestamp: number;
}

interface CachedResponse {
  result: string;
  expiresAt: number;
}

export class InferenceBridgeDO extends DurableObject {
  private computeSocket: WebSocket | null = null;
  private pendingRequests: Map<string, (response: string) => void> = new Map();

  async fetch(request: Request): Promise<Response> {
    if (request.headers.get("Upgrade")?.toLowerCase() === "websocket") {
      const [client, server] = Object.values(new WebSocketPair());
      this.computeSocket = server;
      server.accept();
      return new Response(null, { status: 101, webSocket: client });
    }

    const body: RequestPayload = await request.json();
    const cacheKey = await this.hashRequest(body.prompt);
    
    // Check durable storage for cached results
    const cached = await this.ctx.storage.get<CachedResponse>(cacheKey);
    if (cached && Date.now() < cached.expiresAt) {
      return new Response(JSON.stringify({ result: cached.result, source: "cache" }), {
        headers: { "Content-Type": "application/json" }
      });
    }

    // Forward to compute node and await response
    const responsePromise = new Promise<string>((resolve) => {
      this.pendingRequests.set(body.id, resolve);
    });

    if (this.computeSocket?.readyState === WebSocket.READY_STATE_OPEN) {
      this.computeSocket.send(JSON.stringify(body));
    } else {
      return new Response("Compute node offline", { status: 503 });
    }

    const result = await responsePromise;
    
    // Cache result with 60s TTL
    await this.ctx.storage.put(cacheKey, {
      result,
      expiresAt: Date.now() + 60000
    });

    return new Response(JSON.stringify({ result, source: "live" }), {
      headers: { "Content-Type": "application/json" }
    });
  }

  // Hibernation API: survives DO memory eviction
  webSocketMessage(ws: WebSocket, message: string | ArrayBuffer): void {
    const payload = JSON.parse(message as string);
    const resolver = this.pendingRequests.get(payload.id);
    if (resolver) {
      resolver(payload.result);
      this.pendingRequests.delete(payload.id);
    }
  }

  webSocketClose(ws: WebSocket, code: number, reason: string, wasClean: boolean): void {
    this.computeSocket = null;
    console.warn(`Compute node disconnected: ${reason}`);
  }

  webSocketError(ws: WebSocket, error: unknown): void {
    this.computeSocket = null;
    console.error(`WebSocket error:`, error);
  }

  private async hashRequest(prompt: string): Promise<string> {
    const encoder = new TextEncoder();
    const data = encoder.encode(prompt);
    const hashBuffer = await crypto.subtle.digest("SHA-256", data);
    return Array.from(new Uint8Array(hashBuffer)).map(b => b.toString(16).padStart(2, "0")).join("");
  }
}

Architecture Rationale:

The Hibernation API (webSocketMessage, webSocketClose, webSocketError) replaces traditional addEventListener calls. Cloudflare automatically invokes these methods after memory eviction, preserving message routing.
this.ctx.storage persists cache entries and connection state across evictions, preventing health endpoint drift.
SHA-256 request hashing enables deterministic caching. Identical prompts return cached results instantly, masking the 5–15 second inference latency.
The pending request map bridges the asynchronous gap between HTTP client timeouts and notebook generation time.

Pitfall Guide

1. Incorrect Repository Tagging

Explanation: Developers often search for Qwen3-8B-Instruct on Hugging Face. The instruct variant is not a separate repository; it's activated via the chat template during inference. Fix: Use Qwen/Qwen3-8B and apply apply_chat_template() with a user/system message structure before tokenization.

2. VRAM Fragmentation & OOM on Single T4

Explanation: Default device_map="cuda:0" forces the entire fp16 model (~16GB) onto one T4 GPU (15GB VRAM), triggering out-of-memory crashes. Fix: Set device_map="auto" to enable tensor parallelism across both GPUs, or enforce 4-bit quantization via BitsAndBytesConfig to reduce footprint to ~5GB.

3. Jupyter Event Loop Collision

Explanation: Kaggle notebooks run an active asyncio event loop. Calling asyncio.run() or creating a new loop raises RuntimeError: This event loop is already running. Fix: Import and apply nest_asyncio.apply() before initializing WebSocket clients or async inference loops.

4. Durable Object Memory Eviction

Explanation: Cloudflare DOs evict from memory after ~30 seconds of inactivity. In-memory WebSocket listeners and state variables reset, causing silent message drops and false health checks. Fix: Migrate to the Hibernation API. Declare webSocketMessage, webSocketClose, and webSocketError as class methods. Persist timestamps and connection flags to this.ctx.storage.

5. Asynchronous Timeout Mismatch

Explanation: Standard HTTP clients timeout after 10–15 seconds. LLM generation for medium-length prompts frequently exceeds 30–80 seconds, resulting in client-side ETIMEDOUT errors. Fix: Implement deterministic request hashing on the relay. Store responses in durable storage with a short TTL. Subsequent identical requests bypass inference and return cached results instantly.

6. Ephemeral Session Limits

Explanation: Kaggle enforces a 12-hour maximum runtime per session. Notebooks terminate automatically, severing the WebSocket connection and requiring manual restart. Fix: Implement a lightweight watchdog script that monitors notebook uptime and triggers a session restart via Kaggle API, or design the client to handle 503 Service Unavailable with exponential backoff.

7. WebSocket Frame Fragmentation

Explanation: Large JSON payloads or streaming tokens can split across multiple WebSocket frames, causing deserialization errors on the receiver. Fix: Enforce JSON-RPC framing or chunked message protocols. Validate message.length and buffer incomplete frames before parsing. Alternatively, limit max_new_tokens and stream via multiple small payloads.

Production Bundle

Action Checklist

Verify model repository: Use Qwen/Qwen3-8B and apply chat templates instead of hunting for non-existent instruct repos.
Configure quantization: Apply BitsAndBytesConfig(load_in_4bit=True) and device_map="auto" to prevent T4 VRAM exhaustion.
Patch event loop: Run nest_asyncio.apply() immediately after imports in the Kaggle notebook.
Implement DO Hibernation API: Replace addEventListener with webSocketMessage, webSocketClose, and webSocketError class methods.
Persist state to durable storage: Move connection timestamps, cache entries, and pending request maps to this.ctx.storage.
Add deterministic request caching: Hash prompts with SHA-256, store results with TTL, and return cached responses to mask inference latency.
Monitor GPU allocation: Track Kaggle's 30-hour weekly limit and implement session watchdogs to handle 12-hour expiration.
Secure the relay: Add API key validation or JWT verification to the Cloudflare Worker before forwarding requests to the DO.

Decision Matrix

Scenario	Recommended Approach	Why	Cost Impact
Personal AI assistant / internal tooling	Kaggle + CF DO Relay	Zero cost, acceptable latency, full data control	$0
Early-stage MVP validation	Kaggle + CF DO Relay	Rapid prototyping without API budget commitment	$0
Production SaaS with SLA requirements	Commercial API or dedicated VPS	Predictable latency, auto-scaling, guaranteed uptime	$50–$1,200+/mo
Privacy-critical document analysis	Kaggle + CF DO Relay	Data never leaves your compute environment	$0
High-concurrency public endpoint	Commercial API or self-hosted cluster	WebSocket relay cannot handle >50 concurrent requests	$100–$500+/mo

Configuration Template

Cloudflare Worker Entry Point (index.ts)

import { InferenceBridgeDO } from "./durable-object";

export default {
  async fetch(request: Request, env: Env): Promise<Response> {
    const url = new URL(request.url);
    const id = env.INFERENCE_BRIDGE.idFromName("primary-relay");
    const stub = env.INFERENCE_BRIDGE.get(id);
    return stub.fetch(request);
  }
};

Kaggle Notebook Initialization Cell

import nest_asyncio
nest_asyncio.apply()

# Verify GPU allocation
import torch
print(f"Available GPUs: {torch.cuda.device_count()}")
print(f"VRAM per GPU: {torch.cuda.get_device_properties(0).total_mem / 1e9:.2f} GB")

Quick Start Guide

Deploy the Durable Object: Create a new Cloudflare Worker, paste the InferenceBridgeDO class, and bind it in wrangler.toml under durable_objects.bindings.
Configure the Relay Endpoint: Note the Worker URL (e.g., https://your-worker.workers.dev/relay). This becomes the WebSocket target for the notebook.
Launch Kaggle Notebook: Create a new notebook, enable GPU acceleration, and paste the compute node code. Replace WS_ENDPOINT with your Worker URL.
Execute & Verify: Run the notebook cell to establish the WebSocket connection. Send a test curl request to the Worker's HTTP endpoint. The first response will take 10–30 seconds; subsequent identical requests will return instantly from cache.