I cut my AWS bill by 93% by ditching Fargate for a single Lightsail VM
I cut my AWS bill by 93% by ditching Fargate for a single Lightsail VM
Current Situation Analysis
Pre-revenue and low-traffic applications frequently fall into the managed service cost trap. The original architecture deployed a production-grade, multi-AZ Fargate stack designed for Series-A SaaS reliability, but applied to a directory site with zero users. This created a severe mismatch between infrastructure overhead and actual workload demand.
The primary failure mode stems from infrastructure plumbing costs: NAT gateways, ALBs, VPC interface endpoints, and ElastiCache nodes incur hard monthly floors regardless of traffic volume. In the original setup, ~$87/mo (25% of the total bill) was spent purely on networking and orchestration layers that performed no application logic. Traditional auto-scaling and serverless paradigms fail to mitigate this because baseline provisioning, cross-AZ redundancy, and managed service minimums (e.g., Aurora Serverless v2 defaulting to 0.5 ACU) create an inescapable cost floor. Optimizing Fargate configurations (reducing tasks, enabling auto-pause, switching to Spot) only shaved costs to ~$140/mo because the underlying architectural primitives remain inherently expensive for single-tenant, low-throughput workloads.
WOW Moment: Key Findings
Consolidating the entire stack onto a single Lightsail VM eliminated networking overhead, removed managed service minimums, and preserved the exact same Next.js + Postgres + Redis + BullMQ runtime. The migration demonstrated that vertical consolidation on a fixed-cost VM drastically outperforms horizontally distributed serverless components when traffic is unpredictable or pre-revenue.
| Approach | Monthly Cost | Infrastructure Plumbing Overhead | Setup Complexity | Time to Deploy | Scalability Headroom |
|---|---|---|---|---|---|
| Original Fargate Setup | $345 | ~25% ($87/mo) | High (CDK, multi-AZ, IAM, VPC, WAF) | Hours | Enterprise (auto-scale, multi-AZ) |
| Optimized Fargate (Phase 1) | $140 | ~20% ($28/mo) | Medium (CDK tweaks, Spot, auto-pause tuning) | Hours | Medium (constrained by hard floors) |
| Lightsail Monolith (Phase 2) | $12 | ~0% (consolidated) | Low (Docker Compose, single VM, Caddy) | Afternoon | Low-Medium (vertical scale only) |
Key Findings:
- Hard Cost Floors: NAT, ALB, ElastiCache, and VPC endpoints cannot be scaled to zero. They dictate the minimum viable spend for Fargate.
- Memory-Optimized Builds: Next.js 14 builds spike to 1.5–2GB RAM. A 2GB VM requires explicit swap configuration to prevent OOM kills during compilation.
- Zero-Traffic Efficiency: Lightsail’s fixed pricing model aligns perfectly with pre-revenue projects, delivering identical stack functionality at 93% lower cost.
Core Solution
The migration strategy replaced distributed managed services with a containerized monolith on a $12/mo Lightsail VM. Architecture decisions prioritized cost elimination over high availability, accepting single-point-of-failure trade-offs appropriate for a pre-revenue directory site.
1. Docker Compose Consolidation All services (Postgres, Redis, Next.js web, BullMQ worker) run in isolated containers with explicit memory limits to prevent resource contention on the 2GB VM.
services:
postgres:
image: postgres:16-alpine
volumes: [./data/postgres:/var/lib/postgresql/data]
deploy: { resources: { limits: { memory: 512M } } }
redis:
image: redis:7-alpine
command: redis-server --appendonly yes --maxmemory 128mb --maxmemory-policy noeviction
volumes: [./data/redis:/data]
deploy: { resources: { limits: { memory: 192M } } }
web:
image: tm-web:latest
ports: ["127.0.0.1:3000:3000"]
env_file: .env
depends_on:
postgres: { condition: service_healthy }
redis: { condition: service_healthy }
deploy: { resources: { limits: { memory: 768M } } }
worker:
image: tm-worker:latest
env_file: .env
depends_on:
postgres: { condition: service_healthy }
redis: { condition: service_healthy }
deploy: { resources: { limits: { memory: 384M } } }
2. TLS Termination & Reverse Proxy Caddy replaces CloudFront/ALB/ACM complexity. It automatically provisions Let’s Encrypt certificates via HTTP-01 challenge on first request.
toolmango.com, www.toolmango.com {
reverse_proxy 127.0.0.1:3000
encode gzip zstd
header {
Strict-Transport-Security "max-age=31536000; includeSubDomains; preload"
X-Content-Type-Options "nosniff"
}
}
3. Data Migration from Private Aurora Subnet
Since Aurora Serverless v2 resided in a PRIVATE_ISOLATED subnet, direct pg_dump from outside failed. A one-off Fargate task in the same VPC performed the dump and staged it to S3.
aws ecs run-task \
--cluster tm-prod-compute \
--task-definition tm-prod-pgdump \
--launch-type FARGATE \
--network-configuration 'awsvpcConfiguration={subnets=[subnet-...],securityGroups=[sg-...],assignPublicIp=DISABLED}'
The task definition executes the dump pipeline:
pg_dump --no-owner --no-acl --clean --if-exists -h $DB_HOST -U $DB_USER -d toolmango \
| gzip > /tmp/dump.sql.gz \
&& aws s3 cp /tmp/dump.sql.gz s3://tm-prod-assets/migration/dump.sql.gz
On the Lightsail VM, the dump is retrieved via presigned URL (bypassing IAM role limitations) and piped directly into the local Postgres container:
gunzip -c /tmp/dump.sql.gz | docker compose exec -T postgres psql -U tmadmin -d toolmango
4. Cross-Architecture Build & Memory Management Fargate ran ARM64; Lightsail runs x86_64. Building directly on the target VM eliminated registry push/pull latency. Next.js compilation required 2GB of swap to avoid OOM termination:
docker build --network=host -f Dockerfile.web -t tm-web:latest \
--build-arg NEXT_PUBLIC_SITE_URL=https://toolmango.com \
--build-arg NEXT_PUBLIC_PLAUSIBLE_DOMAIN=toolmango.com \
.
sudo fallocate -l 2G /swapfile
sudo chmod 600 /swapfile && sudo mkswap /swapfile && sudo swapon /swapfile
echo "/swapfile none swap sw 0 0" | sudo tee -a /etc/fstab
Pitfall Guide
- Infrastructure Plumbing Blind Spot: NAT gateways, ALBs, VPC endpoints, and ElastiCache nodes create hard cost floors that do not scale with traffic. Assuming serverless auto-scaling will reduce bills to zero is mathematically incorrect for distributed architectures.
- Next.js Build Memory Spikes: Next.js 14 App Router builds frequently peak at 1.5–2GB RAM. Deploying to 2GB VMs without configuring swap triggers OOM kills during
next build. Always provision swap equal to or greater than peak build memory. - Private Subnet Data Extraction Failure: Aurora/managed databases in
PRIVATE_ISOLATEDsubnets reject externalpg_dumpconnections. You must spin up a same-VPC jump host or ephemeral task to bridge the network boundary before exporting data. - Managed Service Auto-Pause Misconfiguration: Aurora Serverless v2 defaults to
minCapacity: 0.5, incurring continuous charges. Auto-pause only triggers whenminCapacity: 0andsecondsUntilAutoPauseare explicitly configured. Default settings negate cost-saving intentions. - Cross-Architecture Registry Friction: Building ARM64 images locally and pushing to an x86_64 VM introduces pull latency and potential emulation overhead. Building directly on the target architecture skips registry overhead and guarantees native binary compatibility.
- Redis Eviction Policy Misconfiguration: Using default Redis eviction policies (e.g.,
allkeys-lru) on BullMQ can silently drop pending jobs. Explicitly set--maxmemory-policy noevictionand pair it with--maxmemorylimits to fail fast rather than corrupt queue state. - TLS Termination Over-Engineering: Traditional stacks chain CloudFront + ACM + ALB for HTTPS. For single-VM deployments, Caddy’s automatic HTTP-01 challenge reduces certificate provisioning to a single configuration stanza, eliminating certificate rotation complexity and CDN egress costs.
Deliverables
- Blueprint:
Lightsail-Monolith-Architecture.pdf— Network topology, container resource allocation matrix, backup strategy (local cron + S3 sync), and rollback procedures. - Checklist:
Pre-Migration-Audit.md— Validates cost baseline, identifies hard-floor services, confirms data volume size, verifies swap requirements, and documents DNS propagation steps. - Configuration Templates:
docker-compose.prod.yml— Pre-tuned resource limits, health checks, and volume mounts for Postgres/Redis/Next.js/BullMQ.Caddyfile.template— Production-ready reverse proxy with HSTS, compression, and security headers.swap-provision.sh— Idempotent script for safe swapfile creation and fstab persistence.pgdump-ecs-task.json— Ready-to-deploy ECS task definition for private-subnet data extraction.
