lligent routing, connection management, and reputation monitoring.
Architecture Overview
- Producer: Application publishes email jobs to a message queue (e.g., Kafka, RabbitMQ, Redis Streams). Jobs include metadata:
type (transactional/marketing), priority, recipient, and tags.
- Router: Consumes jobs and routes them based on type and reputation score. Transactional emails go to high-priority queues with dedicated IPs. Marketing emails go to bulk queues with shared or rotating IPs.
- Worker Pool: Stateless workers consume from queues. Each worker manages a connection pool to SMTP providers. Workers enforce rate limits and circuit breakers.
- Reputation Manager: Listens to webhooks for bounces, complaints, and delivery events. Updates internal reputation scores and adjusts routing rules dynamically.
- Suppression List: Real-time blocklist of invalid emails. Workers check this list before sending to prevent bounces.
Implementation Details
1. Intelligent Router
The router determines the transport strategy based on email classification and current reputation health.
import { Redis } from 'ioredis';
export enum EmailType {
TRANSACTIONAL = 'transactional',
MARKETING = 'marketing',
NOTIFICATION = 'notification'
}
export interface EmailJob {
id: string;
type: EmailType;
recipient: string;
payload: any;
metadata: Record<string, string>;
}
export class EmailRouter {
private redis: Redis;
private reputationScores: Map<string, number>;
constructor(redisUrl: string) {
this.redis = new Redis(redisUrl);
this.reputationScores = new Map();
}
async route(job: EmailJob): Promise<string> {
// Check suppression list
if (await this.isSuppressed(job.recipient)) {
return 'suppressed';
}
// Select queue based on type and reputation
const queueName = this.selectQueue(job);
await this.redis.lpush(queueName, JSON.stringify(job));
// Emit metrics
await this.emitMetric('email.routed', { type: job.type, queue: queueName });
return queueName;
}
private selectQueue(job: EmailJob): string {
switch (job.type) {
case EmailType.TRANSACTIONAL:
return 'queue:transactional:high-priority';
case EmailType.MARKETING:
return this.getMarketingQueue();
default:
return 'queue:default';
}
}
private getMarketingQueue(): string {
const score = this.reputationScores.get('marketing-ip-pool') || 0;
// If reputation is low, throttle or pause marketing
if (score < 0.7) {
return 'queue:marketing:throttled';
}
return 'queue:marketing:standard';
}
private async isSuppressed(email: string): Promise<boolean> {
return await this.redis.sismember('suppression:list', email);
}
}
2. Worker with Rate Limiting and Circuit Breaking
Workers must respect ISP limits and handle transient failures gracefully.
import { createTransport, Transporter } from 'nodemailer';
import { RateLimiter } from 'limiter';
export class EmailWorker {
private transporter: Transporter;
private rateLimiter: RateLimiter;
private circuitBreaker: CircuitBreaker;
constructor(config: WorkerConfig) {
this.transporter = createTransport({
host: config.smtpHost,
port: config.smtpPort,
secure: true,
pool: true,
maxConnections: config.maxConnections,
maxMessages: config.maxMessages,
});
// ISP-specific rate limits (e.g., Gmail allows ~50-100 msg/min for new IPs)
this.rateLimiter = new RateLimiter({
tokensPerInterval: config.rateLimit,
interval: 'minute'
});
this.circuitBreaker = new CircuitBreaker({
failureThreshold: 5,
resetTimeout: 60000,
});
}
async process(job: EmailJob): Promise<void> {
if (!await this.rateLimiter.tryRemoveTokens(1)) {
throw new Error('Rate limit exceeded');
}
const result = await this.circuitBreaker.fire(async () => {
return await this.transporter.sendMail({
from: job.payload.from,
to: job.recipient,
subject: job.payload.subject,
html: job.payload.body,
headers: {
'List-Unsubscribe': job.payload.unsubscribeLink,
'X-Entity-Ref-ID': job.id,
}
});
});
if (!result.accepted.includes(job.recipient)) {
throw new Error('Recipient rejected');
}
}
}
// Circuit Breaker implementation for resilience
class CircuitBreaker {
private failures: number = 0;
private isOpen: boolean = false;
private resetTimeout: number;
constructor(options: { failureThreshold: number; resetTimeout: number }) {
this.failureThreshold = options.failureThreshold;
this.resetTimeout = options.resetTimeout;
}
async fire<T>(fn: () => Promise<T>): Promise<T> {
if (this.isOpen) {
throw new Error('Circuit breaker is open');
}
try {
const result = await fn();
this.onSuccess();
return result;
} catch (error) {
this.onFailure();
throw error;
}
}
private onSuccess() {
this.failures = 0;
this.isOpen = false;
}
private onFailure() {
this.failures++;
if (this.failures >= this.failureThreshold) {
this.isOpen = true;
setTimeout(() => {
this.isOpen = false;
this.failures = 0;
}, this.resetTimeout);
}
}
}
3. Reputation Management and Warm-up
Scaling requires a systematic warm-up process. New IPs must gradually increase volume.
export class ReputationManager {
async warmUpIp(ip: string, currentVolume: number): Promise<number> {
// Implement warm-up curve: increase volume by 20-30% daily if metrics are healthy
const metrics = await this.fetchMetrics(ip);
if (metrics.bounceRate > 0.02 || metrics.complaintRate > 0.001) {
// Pause or reduce volume if reputation degrades
return Math.max(0, currentVolume * 0.5);
}
// Safe increase
return currentVolume * 1.2;
}
private async fetchMetrics(ip: string) {
// Query provider API or parse feedback loop webhooks
return {
bounceRate: 0.005,
complaintRate: 0.0002,
};
}
}
Architecture Decisions and Rationale
- Queue-Based Decoupling: Decoupling production from delivery allows the application to scale independently of email infrastructure. It also enables backpressure handling; if delivery slows down, the queue buffers without impacting user experience.
- Connection Pooling: Reusing SMTP connections reduces latency and avoids hitting connection limits. Nodemailer's pool mode is essential for high-throughput workers.
- Traffic Segmentation: Mixing transactional and marketing emails on the same IP is a critical error. Marketing emails generate higher complaint rates, which can poison transactional delivery. Dedicated IP pools or provider routing rules isolate these risks.
- Real-Time Suppression: Processing bounces and complaints via webhooks and updating suppression lists in real-time prevents sending to invalid addresses, preserving reputation.
Pitfall Guide
1. Synchronous Sending in Request Path
Mistake: Calling sendMail directly within the HTTP request handler.
Impact: Increases request latency, risks timeouts, and blocks application threads. If the email provider is slow, the entire API degrades.
Fix: Always use an asynchronous queue. Return a 202 Accepted immediately after publishing the job.
2. Ignoring ISP Warm-Up Curves
Mistake: Sending high volume immediately from a new IP or domain.
Impact: ISPs flag the spike as spam behavior, leading to immediate throttling or blacklisting.
Fix: Implement a warm-up script that gradually increases volume over 4-6 weeks. Monitor reputation scores daily during this period.
3. Mixing Traffic Types on Shared IPs
Mistake: Sending password resets and promotional newsletters from the same IP address.
Impact: Complaints from marketing emails lower the IP reputation, causing transactional emails to land in spam.
Fix: Use separate IP pools or provider routing rules. Transactional traffic should have dedicated resources with strict reputation monitoring.
4. No Feedback Loop Processing
Mistake: Assuming 250 OK means successful delivery and not processing bounces/complaints.
Impact: Continued sending to invalid addresses increases bounce rates, triggering ISP filters.
Fix: Configure webhooks for delivery events. Process bounces and complaints to update suppression lists and adjust routing.
5. Hardcoded Credentials Without Rotation
Mistake: Embedding SMTP credentials in code or environment variables without rotation policies.
Impact: Security risk and potential service interruption if credentials are compromised or expired.
Fix: Use a secrets manager (e.g., AWS Secrets Manager, HashiCorp Vault) with automatic rotation. Implement credential rotation logic in workers.
6. DNS Misconfiguration (SPF/DKIM/DMARC)
Mistake: Sending emails without proper DNS records.
Impact: ISPs cannot verify sender authenticity, leading to high spam folder placement or rejection.
Fix: Configure SPF, DKIM, and DMARC records correctly. Use DMARC policies (p=quarantine or p=reject) to enforce authentication.
7. Assuming "Sent" Equals "Delivered"
Mistake: Tracking only SMTP acceptance, not inbox placement.
Impact: False confidence in system health. Users report missing emails despite "successful" sends.
Fix: Implement delivery tracking via webhooks. Monitor inbox placement rates using seed accounts or third-party services.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Startup / Low Volume (<10k/day) | Shared IP Pool + Provider | Low complexity, sufficient reputation, minimal setup | Low |
| Mid-Size / Mixed Traffic (10k-100k/day) | Tiered Routing + Dedicated Transactional IP | Isolates transactional reputation, allows marketing scaling | Medium |
| Enterprise / High Volume (>100k/day) | Dedicated IP Pools + Custom Router | Full control over reputation, optimized routing, compliance | High |
| Global Distribution | Geo-Distributed Workers + Local IPs | Reduces latency, complies with regional data laws | High |
Configuration Template
Ready-to-use configuration for a tiered routing system.
# email-config.yaml
router:
queues:
transactional:
priority: high
ip_pool: dedicated-transactional
rate_limit: 100 # msgs/min per worker
retry_policy:
max_retries: 3
backoff: exponential
marketing:
priority: low
ip_pool: shared-marketing
rate_limit: 50 # msgs/min per worker
retry_policy:
max_retries: 1
backoff: fixed
reputation:
warm_up:
enabled: true
initial_volume: 100
daily_increase: 0.2
max_daily_increase: 5000
thresholds:
bounce_rate: 0.02
complaint_rate: 0.001
action: throttle
dns:
domains:
- name: mail.example.com
spf: "v=spf1 include:_spf.google.com ~all"
dkim: "selector1._domainkey.example.com"
dmarc: "v=DMARC1; p=quarantine; rua=mailto:dmarc@example.com"
Quick Start Guide
- Deploy Queue Infrastructure: Spin up Redis or Kafka. Configure queue names for transactional and marketing traffic.
- Run Worker Service: Deploy the worker service with the configuration template. Ensure connection pooling and rate limiting are active.
- Integrate Producer: Update application code to publish email jobs to the queue instead of sending directly. Add metadata for routing.
- Verify DNS and Routing: Check DNS records using online tools. Send test emails and verify routing via logs. Monitor initial delivery metrics.
- Enable Feedback Loops: Configure webhooks in your email provider. Test bounce and complaint processing by sending to known invalid addresses.
Scaling email delivery requires a shift from throughput-centric thinking to reputation-centric architecture. By implementing decoupled queues, intelligent routing, strict rate limiting, and real-time reputation management, teams can scale email systems to millions of messages per day while maintaining high deliverability and system resilience.