ind text extraction with a five-layer DNS verification stack and a localized path enumeration engine. Validation executes inline during the crawl phase, and results are cached per domain to prevent redundant network calls.
1. Domain-Level DNS Verification & Grading
Instead of validating per-email, we validate per-domain. Email addresses sharing a domain inherit the same routing characteristics. This reduces DNS query volume by 80β90% in typical datasets.
import { resolveMx, resolveTxt } from 'node:dns/promises';
import { LRUCache } from 'lru-cache';
interface DeliverabilityReport {
mxCount: number;
hasSpf: boolean;
hasDmarc: boolean;
isCatchAll: boolean | null;
grade: 'high' | 'medium' | 'low' | 'unknown';
}
export class DomainValidator {
private cache: LRUCache<string, DeliverabilityReport>;
constructor(ttlMs: number = 3600000, maxEntries: number = 10000) {
this.cache = new LRUCache({ ttl: ttlMs, max: maxEntries });
}
async assess(domain: string): Promise<DeliverabilityReport> {
const cached = this.cache.get(domain);
if (cached) return cached;
const report: DeliverabilityReport = {
mxCount: 0,
hasSpf: false,
hasDmarc: false,
isCatchAll: null,
grade: 'unknown',
};
try {
const mxRecords = await resolveMx(domain);
report.mxCount = mxRecords.length;
} catch {
report.mxCount = 0;
}
if (report.mxCount === 0) {
report.grade = 'low';
this.cache.set(domain, report);
return report;
}
try {
const txtRecords = await resolveTxt(domain);
const joined = txtRecords.flat().join(' ').toLowerCase();
report.hasSpf = joined.includes('v=spf1');
} catch { /* TXT lookup failure is non-fatal */ }
try {
const dmarcRecords = await resolveTxt(`_dmarc.${domain}`);
const joined = dmarcRecords.flat().join(' ').toLowerCase();
report.hasDmarc = joined.includes('v=dmarc1');
} catch { /* DMARC absence is common */ }
report.grade = this.calculateGrade(report);
this.cache.set(domain, report);
return report;
}
private calculateGrade(report: DeliverabilityReport): DeliverabilityReport['grade'] {
if (report.mxCount > 0 && report.hasSpf && report.hasDmarc) return 'high';
if (report.mxCount > 0) return 'medium';
return 'low';
}
}
Architectural Rationale:
- Per-Domain Caching: DNS queries are expensive and rate-limited by upstream resolvers. Caching at the domain level ensures that 50 emails from
@acme.com trigger exactly one MX/SPF/DMARC lookup.
- LRU Eviction: Memory-bound environments require bounded caches. An LRU strategy with a 1-hour TTL balances freshness against resource consumption.
- Graceful Degradation: TXT record failures do not abort the process. Missing SPF/DMARC is common and should downgrade the grade rather than block extraction.
2. Localized Endpoint Enumeration
Hardcoded English paths leave international inventory untouched. The crawler must dynamically probe region-specific directories before falling back to generic patterns.
import { fetch } from 'undici';
const LOCALIZED_PATHS: Record<string, string[]> = {
en: ['/contact', '/contact-us', '/about', '/about-us', '/team'],
de: ['/impressum', '/kontakt', '/ansprechpartner', '/ueber-uns'],
hu: ['/kapcsolat', '/elerhetoseg', '/rolunk'],
es: ['/contacto', '/contactar', '/contactenos', '/quienes-somos'],
it: ['/contatti', '/contattaci', '/chi-siamo'],
fr: ['/contactez-nous', '/nous-contacter', '/a-propos'],
pl: ['/kontakt-z-nami', '/kontakty', '/o-nas'],
pt: ['/contato', '/contatos', '/contacto', '/sobre-nos'],
nl: ['/contact', '/over-ons', '/neem-contact-op'],
};
export class LocalizedCrawler {
private emailRegex = /\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}\b/gi;
async extractFromDomain(baseUrl: string, locale?: string): Promise<Set<string>> {
const discovered = new Set<string>();
const paths = locale
? [...(LOCALIZED_PATHS[locale] || []), ...LOCALIZED_PATHS.en]
: LOCALIZED_PATHS.en;
for (const path of paths) {
try {
const target = new URL(path, baseUrl).toString();
const response = await fetch(target, {
signal: AbortSignal.timeout(5000),
headers: { 'User-Agent': 'LeadExtractor/1.0' }
});
if (!response.ok) continue;
const html = await response.text();
const matches = html.match(this.emailRegex) || [];
matches.forEach(email => discovered.add(email.toLowerCase()));
} catch {
// Network timeout, DNS failure, or malformed URL
}
}
return discovered;
}
}
Architectural Rationale:
- Locale-Aware Fallback: When a locale is specified, localized paths are prioritized, but English paths remain as a safety net. This prevents zero-yield scenarios on bilingual sites.
- AbortController Integration: Hard timeouts prevent hanging requests from blocking the event loop or exhausting connection pools.
- Set-Based Deduplication: Using
Set guarantees case-insensitive uniqueness without secondary filtering passes.
3. Production Pipeline Architecture
Raw extraction and validation must be wrapped in operational safeguards to prevent billing spikes and reputation damage.
- Preflight Runtime Estimation: Before executing a crawl, calculate expected duration based on target count, path enumeration depth, and DNS resolution overhead. If
estimated_runtime > platform_timeout, abort immediately. This prevents zero-value billing events on compute-limited infrastructure.
- Event-Driven Billing Alignment: Shift from compute-unit (CU) pricing to
PAY_PER_EVENT models. Charging per delivered lead ($0.005) plus a minimal run-start fee ($0.00005) aligns infrastructure costs with actual business value. CU pricing penalizes unpredictable crawl depths and creates budget volatility.
- Delta Deduplication: Track
placeId, cid, or normalized domain hashes across execution cycles. Skip known entities before triggering enrichment or validation steps. This reduces redundant DNS queries and prevents duplicate lead injection into CRM systems.
Pitfall Guide
-
Treating Catch-All Domains as Valid
Explanation: Domains accepting all RCPT TO commands appear valid during SMTP handshakes but silently drop messages. Routing these to primary relays inflates bounce rates and triggers ISP throttling.
Fix: Implement lightweight VRFY/RCPT TO probes with strict timeouts. Flag catch-all domains as medium grade and route them through warming infrastructure or manual verification queues.
-
Skipping Authentication Record Verification
Explanation: SPF and DMARC absence doesn't guarantee delivery failure, but it guarantees inbox provider scrutiny. Modern filters (Gmail, Outlook, Yahoo) heavily weight authentication presence.
Fix: Always resolve _dmarc.domain and root TXT records. Use the presence/absence of these records to calculate deliverability grades rather than relying on MX existence alone.
-
English-Only Path Enumeration
Explanation: Hardcoding /contact or /about misses legally mandated disclosure pages in the EU and culturally standard endpoints in LATAM/Asia. This creates systematic geographic bias in lead datasets.
Fix: Maintain an extensible locale dictionary. Prioritize region-specific paths based on target market, and always include a fallback to generic English paths.
-
Per-Email DNS Resolution
Explanation: Querying DNS for every extracted email address multiplies latency and costs. A single domain with 200 emails triggers 200 redundant MX/SPF lookups.
Fix: Implement domain-level caching with TTL eviction. Resolve once per domain, cache the result, and apply it to all associated addresses. This reduces query volume by 80β90%.
-
Compute-Unit Billing for Variable Workloads
Explanation: CU pricing charges for idle time, network latency, and unpredictable crawl depths. Teams frequently exceed budgets when targets require deep path enumeration or slow DNS resolution.
Fix: Migrate to per-result or event-driven billing. Align costs with delivered leads rather than execution time. Implement preflight estimators to reject runs that exceed timeout thresholds.
-
Missing Delta Deduplication
Explanation: Re-scraping known businesses wastes bandwidth, triggers duplicate billing events, and pollutes CRM databases with redundant records.
Fix: Store normalized identifiers (placeId, domain_hash, cid) in a persistent key-value store. Compare against previous run snapshots before initiating validation or enrichment steps.
-
Unbounded Memory Caches
Explanation: Storing DNS results without eviction limits causes memory leaks in long-running scrapers. OOM kills interrupt pipelines and corrupt partial datasets.
Fix: Use LRU caches with explicit size limits and TTL expiration. Monitor heap usage and implement periodic cache compaction in containerized environments.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| High-volume EU lead generation | Localized path enumeration + inline DNS grading | Legal disclosure pages yield verified contacts; DNS validation prevents reputation decay | ~$0.005/1k leads vs $4-10/1k with 3rd party APIs |
| Rapid MVP testing | Regex-only extraction + post-processing validation | Faster iteration; validation can be deferred until product-market fit is confirmed | Higher bounce risk; acceptable for internal testing only |
| Reputation-sensitive campaigns | Strict high grade routing + SMTP warming | Prevents inbox filtering; maintains sender domain health during scale-up | Slightly lower volume; higher deliverability ROI |
| Budget-constrained operations | Self-hosted DNS resolver + LRU caching | Eliminates commercial validator markup; reduces latency to ~50ms/domain | Requires infrastructure maintenance; negligible compute cost |
Configuration Template
{
"pipeline": {
"dns_resolver": {
"cache_ttl_ms": 3600000,
"max_entries": 10000,
"fallback_resolver": "8.8.8.8",
"query_timeout_ms": 2000
},
"crawler": {
"locales": ["de", "hu", "es", "it", "fr", "pl", "pt", "nl"],
"max_paths_per_domain": 12,
"http_timeout_ms": 5000,
"concurrency_limit": 8
},
"validation": {
"grade_thresholds": {
"high": ["mx_count > 0", "spf_present", "dmarc_present"],
"medium": ["mx_count > 0"],
"low": ["mx_count == 0"]
},
"catch_all_detection": {
"enabled": true,
"probe_timeout_ms": 3000,
"action_on_detect": "quarantine"
}
},
"deduplication": {
"strategy": "domain_hash",
"storage": "redis",
"ttl_days": 30
},
"billing": {
"model": "PAY_PER_EVENT",
"cost_per_lead": 0.005,
"preflight_abort_threshold_ms": 180000
}
}
}
Quick Start Guide
- Initialize the Validator: Instantiate
DomainValidator with a 1-hour TTL and 10k entry limit. This creates the in-memory cache for DNS results.
- Configure Localized Paths: Load the
LOCALIZED_PATHS dictionary and specify target locales based on your campaign geography.
- Run Preflight Estimation: Calculate expected runtime using target count Γ path depth Γ average DNS latency. Abort if the estimate exceeds your platform's timeout threshold.
- Execute Crawl & Validation: Call
LocalizedCrawler.extractFromDomain() for each target URL. Pipe discovered emails into DomainValidator.assess() to generate deliverability grades.
- Route by Grade: Inject
high grade leads directly into your primary SMTP relay. Queue medium/low grades for warming infrastructure or manual verification. Apply delta deduplication before CRM ingestion.