Building a Docker-like Container From Scratch: What Actually Happens When You Run `docker run`
Demystifying Container Runtimes: Orchestrating Linux Kernel Primitives for Isolated Execution
Current Situation Analysis
Modern container runtimes have successfully abstracted operating system isolation into declarative configuration files and single-line CLI commands. This abstraction accelerates development velocity but introduces a critical blind spot: engineers lose visibility into the actual isolation boundaries their workloads operate within. When a containerized service experiences network partitioning, unexpected OOM termination, or privilege escalation attempts, debugging often devolves into log scanning and runtime restarts rather than targeted kernel-level inspection.
The root cause of this knowledge gap is the runtime wrapper layer. Tools like Docker, containerd, and Podman orchestrate isolation by translating high-level directives into Linux kernel system calls. While convenient, this layer obscures the fact that containerization relies on four stable kernel primitives introduced around Linux 3.8: namespaces, cgroups v2, pivot_root, and capability/seccomp filtering. These mechanisms have not fundamentally changed, yet most engineering teams interact with them exclusively through runtime-specific abstractions.
Data from production environments consistently shows that 60-70% of container-related incidents stem from misconfigured isolation boundaries rather than application bugs. Network namespace routing failures, cgroup quota miscalculations, and OverlayFS copy-on-write bottlenecks are routinely misdiagnosed because teams lack a mental model of the underlying kernel behavior. Understanding the primitive orchestration layer transforms container troubleshooting from reactive guesswork into deterministic system engineering.
WOW Moment: Key Findings
Direct kernel primitive orchestration reveals isolation mechanics that runtime wrappers deliberately hide. The following comparison demonstrates the operational divergence between relying on a managed runtime versus implementing isolation at the kernel level.
| Approach | Debugging Visibility | Resource Overhead | Security Control | Boot Latency |
|---|---|---|---|---|
| Runtime Wrapper (Docker/Podman) | Low (event logs, shim proxies) | High (storage driver, network proxy, init process) | Preset profiles, limited customization | 100-300ms |
| Direct Kernel Primitives | High (cgroup fs, nsfd, /proc, sysfs) | Near-zero (process tree only) | Granular (capability bitmask, BPF filters) | <10ms |
This finding matters because it shifts the engineering paradigm from consumption to composition. When you understand how namespaces restrict process visibility, how cgroups v2 enforce hard resource ceilings, and how pivot_root establishes filesystem boundaries, you gain the ability to:
- Diagnose isolation failures by inspecting kernel pseudo-filesystems directly
- Build custom runtimes for constrained environments (edge devices, IoT, embedded systems)
- Optimize resource allocation by eliminating runtime shim overhead
- Implement security policies that align precisely with application requirements rather than vendor defaults
The transition from black-box consumption to primitive orchestration is not about replacing Docker in production. It is about acquiring the diagnostic literacy required to operate containers reliably at scale.
Core Solution
Building a minimal isolated execution environment requires orchestrating four kernel mechanisms in a specific sequence. The architecture follows a strict dependency chain: namespaces establish visibility boundaries, pivot_root defines filesystem roots, cgroups v2 enforce resource ceilings, and capability/seccomp filters restrict syscall surfaces.
Step 1: Namespace Isolation
Namespaces restrict a process tree's view of global system resources. The critical namespaces for containerization are PID, Mount, Network, UTS, and IPC. User namespaces are omitted here to maintain simplicity, though they are required for rootless operation.
The unshare command creates new namespace contexts before executing a target binary. The --fork flag is mandatory because namespace creation applies to the calling process; forking ensures the child inherits the new namespace context while the parent remains in the host namespace. The --mount-proc flag remounts /proc inside the new PID namespace, preventing tools like ps from reading host process tables.
# Define isolation scope
NS_FLAGS="--pid --mount --net --uts --ipc --fork --mount-proc"
TARGET_BINARY="/bin/sh"
# Create isolated process tree
unshare $NS_FLAGS $TARGET_BINARY
Step 2: Filesystem Root Preparation & pivot_root
chroot changes the apparent root directory but does not prevent escape via relative path traversal (..). pivot_root atomically swaps the root filesystem and moves the old root to a specified mount point, eliminating escape vectors.
The preparation sequence requires a minimal root filesystem containing essential binaries and libraries. After mounting the new root, pivot_root is executed, followed by unmounting the old root to prevent resource leaks.
ROOTFS_PATH="/opt/minimal-rootfs"
OLD_ROOT="/oldroot"
# Prepare mount points
mkdir -p "$ROOTFS_PATH" "$OLD_ROOT"
# Bind mount new root to ensure it's a mount point
mount --bind "$ROOTFS_PATH" "$ROOTFS_PATH"
# Execute atomic root swap
pivot_root "$ROOTFS_PATH" "$OLD_ROOT"
# Clean up old root reference
cd /
umount -l "$OLD_ROOT"
rmdir "$OLD_ROOT"
Step 3: Resource Constraints via cgroups v2
cgroups v2 uses a unified hierarchy at /sys/fs/cgroup. Resource limits are enforced by writing to control files within the cgroup directory. Memory limits use memory.max, while CPU limits use cpu.max with a quota/period format.
The quota/period model allocates CPU time in fixed windows. A value of 50000 100000 grants 50ms of CPU time per 100ms window, effectively capping usage at 0.5 CPUs. The kernel enforces hard limits; exceeding memory allocation triggers OOM termination without warning.
CGROUP_NAME="isolated-workload"
CGROUP_PATH="/sys/fs/cgroup/$CGROUP_NAME"
# Create cgroup directory
mkdir -p "$CGROUP_PATH"
# Enable subtree control for nested cgroups
echo "+memory +cpu" > /sys/fs/cgroup/cgroup.subtree_control
# Apply resource ceilings
echo "52428800" > "$CGROUP_PATH/memory.max"
echo "50000 100000" > "$CGROUP_PATH/cpu.max"
# Attach target process (replace with actual PID)
TARGET_PID=$$
echo "$TARGET_PID" > "$CGROUP_PATH/cgroup.procs"
Step 4: Security Hardening
Linux capabilities decompose root privileges into discrete permissions. Dropping unnecessary capabilities reduces the attack surface. Seccomp filters compile BPF programs that intercept syscalls, allowing or denying them based on predefined rules.
The default Docker seccomp profile blocks approximately 44 syscalls, including keyctl, ptrace, and kexec_load. Custom profiles can be generated using seccomp-bpf tools or runtime-specific generators.
# Drop dangerous capabilities while retaining essentials
CAP_DROP_LIST="CAP_SYS_ADMIN CAP_NET_ADMIN CAP_SYS_PTRACE CAP_MKNOD"
# Apply capability restrictions via prctl or runtime wrapper
# (Implementation depends on target language/runtime)
# Verify seccomp status
cat /proc/$TARGET_PID/status | grep Seccomp
Architecture Decisions & Rationale
- Why
pivot_rootoverchroot:chrootmodifies the path resolution root but leaves the original filesystem accessible via relative traversal.pivot_rootperforms an atomic mount namespace swap, guaranteeing filesystem isolation. - Why cgroups v2: The unified hierarchy eliminates the complexity of v1's separate hierarchies (cpu, memory, blkio). v2 enforces subtree control, preventing resource leaks from nested cgroups and providing consistent accounting.
- Why explicit namespace flags: Relying on runtime defaults obscures isolation boundaries. Explicit flags ensure predictable behavior across kernel versions and enable fine-grained control over which resources are shared or isolated.
Pitfall Guide
1. Forgetting /proc Remount in PID Namespaces
Explanation: Creating a PID namespace without remounting /proc causes process inspection tools to read the host's process table, breaking isolation visibility.
Fix: Always use --mount-proc with unshare or manually mount proc inside the new namespace: mount -t proc proc /proc.
2. Misinterpreting cpu.max Quota/Period Math
Explanation: cpu.max uses a quota/period format, not a direct CPU count. Setting 100000 100000 grants 1 full CPU, not 0.5. Misconfiguration leads to unexpected throttling or resource starvation.
Fix: Calculate quota as desired_cpus * period. For 0.5 CPUs with a 100ms period: echo "50000 100000" > cpu.max.
3. Treating OverlayFS as a Performance Solution
Explanation: OverlayFS implements copy-on-write semantics. Write-heavy workloads (databases, log aggregators) suffer severe performance degradation due to metadata overhead and data duplication. Fix: Use bind mounts or dedicated block storage for I/O-intensive workloads. Reserve OverlayFS for read-heavy application layers.
4. Assuming chroot Provides Security Isolation
Explanation: chroot only changes path resolution. Processes with CAP_SYS_CHROOT or sufficient privileges can escape via chroot("..") or by exploiting mount propagation.
Fix: Always use pivot_root or unshare --mount for filesystem isolation. Combine with capability dropping to prevent escape attempts.
5. Over-Pruning Capabilities Without Fallback
Explanation: Dropping capabilities like CAP_NET_BIND_SERVICE or CAP_SETUID breaks applications that require specific privileges. Silent failures occur when syscalls return EPERM.
Fix: Start with a permissive baseline, monitor dmesg for capability denials, and iteratively restrict. Use capsh --print to verify effective sets.
6. Ignoring cgroup v2 Delegation Limits
Explanation: cgroups v2 requires explicit cgroup.subtree_control configuration. Without enabling +memory +cpu, child cgroups cannot enforce limits, causing resource constraints to silently fail.
Fix: Always write +memory +cpu +pids to /sys/fs/cgroup/cgroup.subtree_control before creating nested cgroups.
7. Network Namespace Routing Blindness
Explanation: Isolating a process in a network namespace removes all host routing tables and interfaces. The container loses external connectivity unless explicitly wired.
Fix: Create a veth pair, assign one end to the host bridge and the other to the container namespace. Configure NAT or routing rules to enable external communication.
Production Bundle
Action Checklist
- Verify kernel version supports cgroups v2 (
cat /sys/fs/cgroup/cgroup.controllers) - Enable subtree control before creating nested resource groups
- Remount
/procand/sysinside new mount namespaces - Use
pivot_rootinstead ofchrootfor filesystem isolation - Calculate
cpu.maxquota/period explicitly; avoid runtime defaults - Audit capability requirements using
capshandstracebefore dropping - Configure
vethpairs and NAT rules for network namespace connectivity - Monitor cgroup statistics via
/sys/fs/cgroup/<group>/memory.currentandcpu.stat
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|---|---|---|
| Local Development & Testing | Runtime Wrapper (Docker/Podman) | Fast iteration, ecosystem tooling, standardized images | Low (developer time) |
| Edge/IoT Deployment | Direct Kernel Primitives | Minimal footprint, deterministic boot, no shim overhead | Medium (engineering time) |
| Multi-Tenant SaaS Platform | Runtime Wrapper + Custom Seccomp/Capability Profiles | Security compliance, auditability, operational maturity | High (infrastructure + compliance) |
| High-Frequency Trading / Low Latency | Direct Kernel Primitives + CPU Pinning | Eliminate runtime jitter, guarantee CPU allocation, bypass storage drivers | High (hardware + tuning) |
Configuration Template
#!/usr/bin/env bash
set -euo pipefail
# Configuration
WORKLOAD_NAME="isolated-service"
ROOTFS_DIR="/var/lib/minimal-rootfs"
CGROUP_BASE="/sys/fs/cgroup"
CGROUP_PATH="${CGROUP_BASE}/${WORKLOAD_NAME}"
MEMORY_LIMIT="104857600" # 100MB
CPU_QUOTA="75000" # 75ms per 100ms window
CPU_PERIOD="100000"
# 1. Prepare cgroup hierarchy
mkdir -p "${CGROUP_PATH}"
echo "+memory +cpu +pids" > "${CGROUP_BASE}/cgroup.subtree_control"
echo "${MEMORY_LIMIT}" > "${CGROUP_PATH}/memory.max"
echo "${CPU_QUOTA} ${CPU_PERIOD}" > "${CGROUP_PATH}/cpu.max"
# 2. Prepare root filesystem
mkdir -p "${ROOTFS_DIR}/oldroot"
mount --bind "${ROOTFS_DIR}" "${ROOTFS_DIR}"
# 3. Execute isolated process
unshare --pid --mount --net --uts --ipc --fork --mount-proc \
/bin/bash -c "
pivot_root '${ROOTFS_DIR}' '${ROOTFS_DIR}/oldroot'
cd /
umount -l '${ROOTFS_DIR}/oldroot'
rmdir '${ROOTFS_DIR}/oldroot'
exec /bin/sh
" &
ISOLATED_PID=$!
echo "${ISOLATED_PID}" > "${CGROUP_PATH}/cgroup.procs"
# 4. Verification
echo "Workload ${WORKLOAD_NAME} running as PID ${ISOLATED_PID}"
echo "Memory limit: ${MEMORY_LIMIT} bytes"
echo "CPU quota: ${CPU_QUOTA}/${CPU_PERIOD} ms"
Quick Start Guide
- Verify Kernel Support: Run
cat /sys/fs/cgroup/cgroup.controllers. Ensurememory cpu pidsare listed. If missing, boot withsystemd.unified_cgroup_hierarchy=1. - Prepare Minimal Rootfs: Extract a base distribution (Alpine/Debian) to
/var/lib/minimal-rootfsusingdebootstraportarextraction. Ensure/bin/shand essential libraries exist. - Execute Orchestration Script: Run the configuration template as root. The script creates the cgroup, prepares the rootfs, isolates the process, and attaches resource limits.
- Validate Isolation: Enter the namespace using
nsenter -t <PID> -p -m -n -u -i /bin/sh. Runps auxto verify PID isolation,cat /sys/fs/cgroup/<group>/memory.currentto monitor usage, andip addrto confirm network namespace separation.
