Node.js vs Bun vs Go - A Multi-Layer HTTP Benchmark

Runtime Overhead vs. Network Reality: A Layered Performance Analysis of Node.js, Bun, and Go

Current Situation Analysis

Engineering teams frequently face pressure to migrate runtimes based on viral benchmarks claiming "blazing fast" performance. The industry pain point is not a lack of data, but a prevalence of misleading data. Most public benchmarks test idealized conditions that do not reflect production constraints, leading to architectural decisions based on event loop speed rather than system throughput.

This problem is overlooked because developers often conflate micro-benchmark efficiency with macro-system performance. A runtime that serves static JSON in microseconds may still underperform in production due to garbage collection pauses, inter-process communication (IPC) overhead, or network stack inefficiencies. Furthermore, comparisons often suffer from implementation bias, where one language is optimized while others use stock patterns.

Data from layered testing reveals that runtime differences are highly context-dependent. In CPU-bound scenarios, Bun demonstrated a ~55% throughput advantage over Node.js in multi-process configurations and nearly 2x the throughput of Go on single-core cloud instances. However, these margins collapsed when network I/O became the constraint. Over a WiFi network, all runtimes converged to a narrow band of 7,900 to 12,800 RPS, proving that hardware limitations can completely mask runtime efficiency. Additionally, Node.js exhibited significant tail latency spikes (up to 2,000 ms) and request timeouts under load, suggesting that raw throughput numbers can hide stability risks if garbage collection is not tuned.

WOW Moment: Key Findings

The most critical insight from this analysis is the convergence point. While runtimes diverge significantly in isolated environments, they converge rapidly once external bottlenecks are introduced. The following table contrasts performance across three distinct constraint layers, highlighting where runtime choice actually matters.

Environment	Constraint Layer	Node.js (4 Cores)	Bun (4 Cores)	Go (4 Cores)	Primary Bottleneck
Localhost	Event Loop / Syscall	~110,000 RPS	~170,000 RPS	~115,000 RPS	Runtime Overhead
Cloud 1-Core	Single Core CPU	~11,700 RPS	~25,400 RPS	~13,900 RPS	Runtime Overhead
Cloud 4-Core	Multi-Core CPU	~31,000 RPS	~53,400 RPS	~37,600 RPS	Runtime Overhead
LAN / WiFi	Network I/O	~7,900 RPS	~12,500 RPS	~12,800 RPS	Network Hardware

Why this matters: The data shows that Bun offers the highest raw efficiency, with a CPU cost per request of 0.0072%, compared to Go at 0.0090% and Node.js at 0.0129%. However, the LAN test demonstrates that if your infrastructure relies on constrained network paths, optimizing the runtime yields diminishing returns. The decision to migrate should be driven by CPU-bound workloads or latency-sensitive single-core operations, not generic throughput assumptions.

Core Solution

To make informed runtime decisions, engineers must adopt a layered evaluation strategy that isolates event loop performance from network and I/O constraints. This approach prevents the "localhost trap" and ensures comparisons are architecturally equivalent.

1. Architectural Equivalence in Implementation

A common error in benchmarking is comparing a single-process runtime with kernel-level socket sharing against a multi-process architecture with IPC overhead. True comparison requires matching the concurrency model.

Bun: Kernel-Level Socket Sharing Bun can leverage reusePort to distribute connections across threads within a single process. This avoids IPC overhead but relies on the runtime's internal scheduler.

// bun-server.ts
// Uses kernel-level load balancing via reusePort.
// Single process, multi-threaded handling.
const server = Bun.serve({
  port: 3000,
  reusePort: true, // Distributes connections to worker threads
  fetch(req: Request) {
    return new Response(
      JSON.stringify({ message: "Hello from Bun" }),
      { headers: { "Content-Type": "application/json" } }
    );
  },
});

console.log(`Listening on ${server.hostname}:${server.port}`);

Node.js: Multi-Process Clustering Node.js requires explicit clustering to utilize multiple cores. This introduces IPC overhead between the master and worker processes, which can impact latency under bursty traffic.

// node-server.ts
// Explicit multi-process architecture.
// Master process manages workers; IPC overhead exists.
import cluster from 'cluster';
import http from 'http';
import os from 'os';

const numCPUs = os.cpus().length;

if (cluster.isPrimary) {
  console.log(`Primary ${process.pid} is running`);
  for (let i = 0; i < numCPUs; i++) {
    cluster.fork();
  }
  
  cluster.on('exit', (worker, code, signal) => {
    console.log(`Worker ${worker.process.pid} died. Restarting.`);
    cluster.fork();
  });
} else {
  // Workers share the server port
  const server = http.createServer((req, res) => {
    res.writeHead(200, { 'Content-Type': 'application/json' });
    res.end(JSON.stringify({ message: "Hello from Node" }));
  });
  
  server.listen(3000, () => {
    console.log(`Worker ${process.pid} started`);
  });
}

Go: M:N Scheduler Go's runtime automatically multiplexes goroutines onto OS threads, utilizing all available cores without manual clustering or IPC overhead.

// go-server.go
// M:N scheduler handles concurrency automatically.
// No explicit clustering required.
package main

import (
	"encoding/json"
	"net/http"
)

func main() {
	http.HandleFunc("/json", func(w http.ResponseWriter, r *http.Request) {
		w.Header().Set("Content-Type", "application/json")
		json.NewEncoder(w).Encode(map[string]string{
			"message": "Hello from Go",
		})
	})

	http.ListenAndServe(":3000", nil)
}

2. Normalizing Payload Complexity

Benchmarks often favor Go by using pre-rendered byte slices, while JavaScript runtimes serialize JSON dynamically. This creates an unfair advantage. To ensure accuracy, all runtimes should perform equivalent work.

Unfair Go Pattern: w.Write([]byte({"message":"Hello"}))
Fair Go Pattern: json.NewEncoder(w).Encode(payload)
Impact: Pre-rendering can artificially inflate Go throughput by 15-20%. Always verify that the code under test performs the same serialization and validation logic.

3. Isolation and Resource Pinning

Cloud benchmarks must eliminate CPU migration and cache invalidation. Using CPU quotas (--cpus) allows the container to float across physical cores, introducing noise. Use CPU pinning (--cpuset-cpus) for deterministic results.

# Correct isolation for cloud benchmarks
docker run --rm --cpuset-cpus="0-3" -m="512m" -p 3000:3000 my-runtime-image

Pitfall Guide

1. The Loopback Illusion

Explanation: Testing over localhost measures memory bus speed and loopback interface efficiency, not real-world network performance. Results here can be 10x higher than network-constrained tests. Fix: Always include a network-constrained phase using a physical NIC or datacenter network to validate results.

2. The Pre-rendered JSON Trap

Explanation: Using static byte arrays in Go while JavaScript runtimes serialize objects creates an implementation bias. This favors Go but does not reflect real application logic. Fix: Ensure all runtimes perform dynamic serialization or apply the same optimization to all languages.

3. `reusePort` vs. Process Clustering

Explanation: Comparing Bun's reusePort (single process) to Node's cluster (multi-process) ignores IPC overhead. Bun's results may appear superior due to architecture, not just runtime speed. Fix: Disclose the concurrency model. For strict equivalence, spawn multiple Bun processes or compare single-process modes.

4. GC Blind Spots and Tail Latency

Explanation: High average throughput can mask garbage collection pauses. Node.js showed max latencies of 2,000 ms and request timeouts, indicating GC pressure under load. Fix: Monitor p99/p99.9 latency and GC metrics. Tune Node flags like --max-old-space-size and --optimize-for-size before drawing conclusions.

5. CPU Percentage Fallacy

Explanation: High CPU usage (e.g., Node at 400%) is often misinterpreted as inefficiency. In multi-process setups, this indicates all workers are saturated, which is desirable under load. Fix: Calculate CPU cost per request (CPU% / RPS) to measure true efficiency. Bun achieved 0.0072% per request vs. Node's 0.0129%.

6. Single-Run Statistical Error

Explanation: Running a benchmark once produces noise, not signal. Variance in system load, network jitter, and scheduler behavior can skew results. Fix: Execute 5+ runs per configuration. Report median, standard deviation, and confidence intervals. Discard outliers.

7. Network Hardware Masking

Explanation: Using consumer-grade hardware (e.g., WiFi 3 adapters) can bottleneck all runtimes to the same low throughput, hiding runtime differences. Fix: Use wired connections or high-speed datacenter networks (10 Gbps+) to ensure the runtime, not the NIC, is the limiting factor.

Production Bundle

Action Checklist

Pin CPU Resources: Use --cpuset-cpus in Docker to prevent CPU migration and cache invalidation during tests.
Normalize Code Complexity: Ensure all runtimes perform equivalent serialization, validation, and business logic. Avoid pre-rendered payloads.
Match Concurrency Models: Compare architectural equivalents (e.g., multi-process vs. multi-process) or explicitly disclose differences.
Tune Garbage Collection: For Node.js, apply V8 tuning flags (--max-old-space-size, --optimize-for-size) and monitor GC pauses.
Analyze Tail Latency: Report p99 and p99.9 latency, not just averages. High max latency indicates stability risks.
Calculate CPU Efficiency: Compute CPU cost per request to compare true resource utilization across runtimes.
Test Network Constraints: Include a phase with physical network I/O to validate performance under realistic conditions.
Run Statistical Samples: Execute 5+ runs per configuration and report median values with standard deviation.

Decision Matrix

Scenario	Recommended Approach	Why	Cost Impact
High Volatility, Low Logic	Bun	Lowest per-request overhead (0.0072% CPU). Fastest single-core throughput. Drop-in Node compatibility.	Low
Max Efficiency, Low Latency	Go	Predictable performance. No IPC overhead. Excellent CPU efficiency (0.0090% CPU). Compiled binary.	Medium
Ecosystem, Stability, Tooling	Node.js	Mature ecosystem. APM integration. Debugging tools. Higher CPU cost (0.0129%) but proven reliability.	Low
Network-Bound Workloads	Any	Network hardware dominates performance. Runtime choice has minimal impact on throughput.	N/A
Single-Core Cloud Instances	Bun	Nearly 2x throughput of Go and Node on single cores. Maximizes limited resources.	Low
Multi-Core Cloud Instances	Bun	Highest multi-core scaling (170k RPS localhost). Efficient kernel-level socket distribution.	Low

Configuration Template

Use this Docker Compose setup for fair, isolated benchmarking with statistical rigor.

# docker-compose.benchmark.yml
version: '3.8'
services:
  target:
    image: ${RUNTIME_IMAGE}
    cpuset: "0-3"
    mem_limit: 512m
    ports:
      - "3000:3000"
    environment:
      - NODE_ENV=production
      - GOMAXPROCS=4

  loadgen:
    image: alpine:latest
    depends_on:
      - target
    command: >
      sh -c "
      apk add --no-cache wrk &&
      wrk -t2 -c200 -d30s --latency --timeout 2s http://target:3000/json
      "

Quick Start Guide

Isolate the Environment: Deploy target and load generator in the same datacenter or use pinned Docker containers. Ensure network hardware is not the bottleneck.
Normalize the Code: Implement equivalent logic across runtimes. Avoid pre-rendered payloads. Match concurrency models or document differences.
Execute Benchmarks: Run wrk with statistical flags (--latency, --timeout). Perform 5+ runs per configuration.
Analyze Results: Calculate median RPS, p99 latency, and CPU cost per request. Identify bottlenecks (runtime vs. network).
Make Decision: Use the Decision Matrix to select the runtime based on workload characteristics, cost, and ecosystem requirements.

Mid-Year Sale — Unlock Full Article