ush cache hit rates above 85%, drastically reducing costly API calls and latency.
- Hard memory limits (3GB/process) combined with staggered PM2 recycling prevent GC cascades and maintain stable latency.
- SQLite with Write-Ahead Logging (WAL) and batched writes reliably handles 50K events/day without external brokers.
- The operational sweet spot caps at 6 concurrent agents, 1000 messages/minute aggregate throughput, and 30-second maximum processing windows to respect webhook timeouts.
Core Solution
The architecture replaces elastic infrastructure primitives with deterministic Unix process management, lightweight persistent queues, and cost-aware model routing. Every component is explicitly bounded to prevent resource starvation.
Process isolation via systemd:
Each agent runs as a dedicated systemd service with hard CPU and memory quotas. When limits are approached, the process is terminated and restarted by PM2, ensuring controlled failure instead of system-wide OOM.
# /etc/systemd/system/agent-telegram-support.service
[Unit]
Description=Telegram Support Agent
After=network.target
[Service]
Type=simple
User=agent
WorkingDirectory=/opt/agents/telegram-support
ExecStart=/usr/bin/node --max-old-space-size=2048 index.js
Restart=on-failure
RestartSec=10
MemoryLimit=3G
CPUQuota=50%
[Install]
WantedBy=multi-user.target
Message queueing without infrastructure:
Inter-agent communication bypasses Redis/RabbitMQ in favor of SQLite with WAL mode. Polling-based consumption with row locking handles SMB-scale workloads without external dependencies.
// Shared message bus using SQLite
class MessageBus {
constructor(dbPath) {
this.db = new Database(dbPath);
this.db.pragma('journal_mode = WAL');
this.db.pragma('busy_timeout = 5000');
}
async publish(topic, message) {
const stmt = this.db.prepare(
'INSERT INTO messages (topic, payload, created_at) VALUES (?, ?, ?)'
);
stmt.run(topic, JSON.stringify(message), Date.now());
}
async consume(topic, handler) {
// Polling-based consumption with row locking
setInterval(async () => {
const messages = this.db.prepare(
'SELECT * FROM messages WHERE topic = ? AND processed = 0 LIMIT 10'
).all(topic);
for (const msg of messages) {
await handler(JSON.parse(msg.payload));
this.db.prepare('UPDATE messages SET processed = 1 WHERE id = ?').run(msg.id);
}
}, 1000);
}
}
Model routing and fallback strategies:
The orchestrator implements cost-based routing with aggressive caching. Groq handles simple queries, Claude handles complex reasoning, and a local Llama 3.1 7B model serves as a deterministic fallback.
class ModelRouter {
async route(prompt, context) {
// Check cache first
const cached = await this.cache.get(this.hashPrompt(prompt));
if (cached && !context.requiresFresh) return cached;
// Groq for simple queries (free tier: 30 req/min)
if (this.isSimpleQuery(prompt) && this.groqQuota.available()) {
try {
return await this.groqComplete(prompt);
} catch (e) {
// Groq fails often under load
}
}
// Claude for complex queries (via API key)
if (this.requiresReasoning(prompt)) {
if (this.claudeCredits > 0) {
return await this.claudeComplete(prompt);
}
}
// Local Llama model as last resort
return await this.localComplete(prompt);
}
}
Monitoring on zero budget:
Observability relies on custom SQLite metrics tables and lightweight bash health checks executed via cron, eliminating paid APM dependencies.
class MetricsCollector {
constructor(dbPath) {
this.db = new Database(dbPath);
this.buffer = new Map();
// Flush metrics every 10 seconds
setInterval(() => this.flush(), 10000);
}
increment(metric, value = 1) {
const current = this.buffer.get(metric) || 0;
this.buffer.set(metric, current + value);
}
async flush() {
const timestamp = Date.now();
const stmt = this.db.prepare(
'INSERT INTO metrics (metric, value, timestamp) VALUES (?, ?, ?)'
);
for (const [metric, value] of this.buffer.entries()) {
stmt.run(metric, value, timestamp);
}
this.buffer.clear();
}
}
#!/bin/bash
# /opt/agents/health-check.sh
# Check each agent endpoint
agents=("telegram-support:3001" "whatsapp-sales:3002" "orchestrator:3003")
for agent in "${agents[@]}"; do
response=$(curl -s -o /dev/null -w "%{http_code}" "http://localhost:${agent#*:}/health")
if [ "$response" != "200" ]; then
systemctl restart "agent-${agent%:*}.service"
echo "$(date): Restarted ${agent%:*}" >> /var/log/agent-restarts.log
fi
done
Pitfall Guide
- Memory Pressure Cascades & GC Pauses: Node.js garbage collection pauses spike when memory exceeds 80%, causing delayed message processing and queue accumulation. Best Practice: Enforce hard memory limits via systemd (
MemoryLimit=3G) and implement staggered PM2 recycling every 6 hours to force GC and prevent cascading failures.
- SQLite Lock Contention: Multiple agents writing concurrently to the same database trigger lock timeouts, even with WAL mode. Best Practice: Implement batched writers with async queues and transaction wrapping to reduce write frequency and hold locks for minimal durations.
- Free-Tier API Degradation: Providers like Groq deprioritize free-tier traffic during peak loads, causing response times to jump from 200ms to 10+ seconds. Best Practice: Implement strict timeout handlers with
AbortController, track quota availability, and route to fallback models immediately upon timeout or rate-limit errors.
- Over-Provisioning Agents Beyond Context Switching Limits: Running more than 6 concurrent agents on 4 ARM cores triggers excessive context switching, degrading throughput and increasing latency. Best Practice: Cap concurrency at 6 agents, allocate 50% CPU quota per service, and monitor context switch metrics to enforce hard boundaries.
- Ignoring Webhook Timeout Thresholds: Telegram and WhatsApp webhooks enforce strict timeout windows (typically 30 seconds). Complex routing or local inference can exceed this, causing message drops. Best Practice: Implement async acknowledgment patterns, cap processing time at 30 seconds, and queue long-running tasks for background completion with status webhooks.
- Lack of Distributed Tracing in Debugging: Without proper tracing, debugging cross-agent flows becomes painful and time-consuming. Best Practice: Embed correlation IDs in all SQLite message payloads, log structured JSON events to a centralized metrics table, and use PM2 logs with agent prefixes for rapid isolation.
Deliverables
- Architecture Blueprint: A single-VM multi-agent topology mapping systemd service boundaries, SQLite WAL message bus topology, and cost-aware routing decision trees. Includes resource partitioning matrix (CPU/Memory allocation per agent type).
- Production Readiness Checklist: Pre-deployment validation (systemd quota verification, SQLite WAL configuration, PM2 ecosystem setup), runtime monitoring thresholds (GC pause alerts, SQLite lock timeout tracking, API quota exhaustion warnings), and maintenance procedures (staggered recycling schedules, cache invalidation strategies).
- Configuration Templates: Ready-to-deploy
systemd service files with hard resource limits, PM2 ecosystem.config.js templates for staggered restarts, SQLite schema definitions for message bus and metrics collection, and cron-scheduled health check scripts with automatic service recovery logic.