rs. |
| Multilingual Crawl | Berlin Mitte | 85% | 65% | $0.005 | Leverages Impressum law; hit rate jumps from 5%. |
| Combined Solution | Austin TX | 53% | 47% | $0.005 | Sweet Spot: Max quality at minimal cost. |
| Combined Solution | Berlin Mitte | 85% | 65% | $0.005 | Sweet Spot: Dominates EU market coverage. |
Key Findings:
- Bounce-Bait Detection: Half of the emails found by the scraper would have bounced or harmed sender reputation if sent unchecked.
- EU Market Leverage: The German
Impressum page is legally required to disclose owner contact info, making it a high-yield target. Localized paths (/impressum, /kapcsolat, etc.) are critical for non-English markets.
- Cost Efficiency: Inline validation eliminates the need for third-party services (e.g., ZeroBounce, NeverBounce), reducing costs by up to $0.01 per lead while maintaining comparable quality.
Core Solution
The optimized scraper implements a 5-layer inline validation pipeline and a multilingual contact-page crawler. This architecture ensures high deliverability and broad market coverage without external dependencies.
1. Inline Email Validation (5-Layer Probe)
Each email undergoes a multi-step verification process:
- MX Records: Checks if the domain accepts mail.
- SPF Record: Verifies sender authorization.
- DMARC Record: Checks for policy enforcement.
- SMTP Probe: Optional
RCPT TO check (often blocked by Gmail/Outlook).
- Catch-All Detection: Identifies servers that accept all addresses.
const dns = require('dns/promises');
async function validateEmail(email) {
const [, domain] = email.split('@');
const result = { mxRecords: 0, hasSpf: false, hasDmarc: false,
smtpValid: null, isCatchAll: null, deliverability: 'unknown' };
// 1. MX records β does the domain accept mail?
try {
const mx = await dns.resolveMx(domain);
result.mxRecords = mx.length;
} catch { result.mxRecords = 0; }
if (result.mxRecords === 0) {
result.deliverability = 'low';
return result;
}
// 2-3. SPF + DMARC TXT lookups
try {
const txt = await dns.resolveTxt(domain);
result.hasSpf = txt.some(arr => arr.join('').toLowerCase().startsWith('v=spf1'));
} catch {}
try {
const dmarcTxt = await dns.resolveTxt(`_dmarc.${domain}`);
result.hasDmarc = dmarcTxt.some(arr => arr.join('').toLowerCase().startsWith('v=dmarc1'));
} catch {}
// 4. SMTP RCPT TO probe (optional, often blocked by Gmail/Outlook)
// ... skipped for brevity, see full code
// 5. Roll up to grade
if (result.mxRecords > 0 && result.hasSpf && result.hasDmarc) {
result.deliverability = 'high';
} else if (result.mxRecords > 0) {
result.deliverability = 'medium';
}
return result;
}
Performance: Per-domain DNS probing takes ~50ms. For 100 emails, total validation time is ~5 seconds. Domain caching reduces costs across batches.
2. Multilingual Contact-Page Crawl
The scraper expands the URL frontier to include localized paths. This is critical for EU markets where legal requirements dictate contact page locations.
const CONTACT_PATHS = [
// English
'/contact', '/contact-us', '/about', '/about-us',
// Hungarian
'/kapcsolat', '/elerhetoseg',
// German (Impressum is legally required)
'/kontakt', '/impressum', '/ansprechpartner',
// Spanish
'/contacto', '/contactar', '/contactenos',
// Italian
'/contatti', '/contattaci',
// French
'/contactez-nous', '/nous-contacter',
// Polish
'/kontakt-z-nami', '/kontakty',
// Czech / Slovak
'/kontakt', '/kontaktujte-nas',
// Portuguese (BR + PT)
'/contato', '/contatos', '/contacto',
// Dutch
'/over-ons',
];
async function crawlForEmails(baseUrl) {
const emails = new Set();
const re = /\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}\b/gi;
for (const path of CONTACT_PATHS) {
try {
const html = await fetch(new URL(path, baseUrl)).then(r => r.text());
(html.match(re) || []).forEach(e => emails.add(e.toLowerCase()));
} catch {}
}
return [...emails];
}
3. Delta Mode & Cost Optimization
To prevent duplicate processing and billing, the scraper implements a delta mode that skips known placeId and cid values before any billable events fire.
// Skip-before-bill check
if (knownPlaceIds.has(item.placeId) || knownCids.has(item.cid)) {
continue; // no enrich, no email validation, no billing event
}
Architecture Decisions:
- Pay-Per-Result Billing: Switched to
$0.005 per delivered lead to ensure predictable costs. Failed runs cost $0.
- Preflight Budget Check: Estimates runtime before execution. If estimated time exceeds timeout, the run is refused to prevent partial results and wasted resources.
Pitfall Guide
- Catch-All Servers: Domains may accept all
RCPT TO commands but bounce messages silently. Always implement catch-all detection to avoid false positives.
- Localization Blindspots: Assuming
/contact exists in non-English markets leads to missed data. Use a comprehensive list of localized paths (e.g., /impressum for Germany, /kapcsolat for Hungary).
- SPF/DMARC Neglect: Sending to domains without SPF or DMARC records increases spam risk. Include these checks in your validation pipeline to protect sender reputation.
- SMTP Probe Blocking: Major providers like Gmail and Outlook often block SMTP probes. Rely on DNS-based validation (MX/SPF/DMARC) as the primary method, with SMTP as optional.
- Duplicate Processing Costs: Without delta mode, recurring scrapes may reprocess and re-bill known leads. Implement
placeId/cid deduplication to skip billable events for existing data.
- Runtime Timeouts: Large datasets can exceed actor timeouts. Use a preflight budget check to estimate runtime and adjust configuration before execution.
- Cost Blowout with External Validators: Using third-party services adds $0.004-$0.010 per lead. Inline validation reduces costs significantly while maintaining quality.
Deliverables
Blueprint: Inline Validation & Multilingual Crawl Architecture
- 5-Layer Validation Pipeline: MX β SPF β DMARC β SMTP β Catch-All.
- Multilingual Path List: 20+ localized contact paths covering major EU and global markets.
- Delta Mode: Skip-before-bill logic for recurring scrapes.
- Preflight Check: Runtime estimation to prevent timeouts.
Checklist: Implementation Guide
Configuration Templates
- Validation Thresholds:
high: MX > 0 AND SPF AND DMARC.
medium: MX > 0.
low: MX = 0.
- Contact Paths Array: Use the
CONTACT_PATHS list provided in the Core Solution.
- Billing Config:
$0.005 per delivered lead + $0.00005 per run start.
Actor Link: Google Maps Email Extractor with Built-in Email Validation
- Free tier includes ~100 leads for testing.
- Source code patterns are MIT-licensed for reuse.