My Google Maps scraper is live for 2 weeks. Half the emails were bounce-bait β here's what I added.
My Google Maps Scraper is Live for 2 Weeks. Half the Emails Were Bounce-Bait β Here's What I Added.
Current Situation Analysis
The standard Google Maps scraping pipeline on platforms like Apify follows a rigid, unvalidated workflow: Search Google Maps β harvest place URLs β visit each website β regex-grep emails β return JSON. While this approach produces seemingly complete JSON outputs, it fundamentally ignores email deliverability and regional web conventions.
In practice, ~50% of scraped emails fail delivery due to:
- Typos & Dead Domains: Malformed addresses (
info@compant.com) or domains returningNXDOMAINon MX lookup. - Catch-All Servers: Domains that accept any
RCPT TOcommand but silently bounce or discard messages later. - Missing Authentication: Absence of SPF/DMARC records at the receiver side, which severely damages sender domain reputation when bulk outreach is attempted.
- Localization Blind Spots: Non-English markets rarely use
/contact. EU countries legally mandate or culturally prefer localized paths (e.g., GermanImpressum, Hungariankapcsolat, Spanishcontacto). Traditional scrapers return near-zero hits in these regions because they only probe English-centric URL frontiers.
Without inline validation and localized crawling, scraper outputs become bounce-bait, wasting compute cycles and risking IP/domain blacklisting.
WOW Moment: Key Findings
| Approach | Email Hit Rate | High Deliverability Rate | Cost per 1k Leads |
|---|---|---|---|
| Traditional Regex-Only Scraper | ~50-60% | ~25% | $4.00 - $10.00 |
| Enhanced Scraper (Inline Validation + Multilingual Crawl) | ~53% (US) / ~85% (EU) | ~25% (US) / ~65% (EU) | $0.00 (Built-in) |
Key Findings:
- Inline DNS/SMTP validation filters out ~50% of unviable emails before they ever enter a CRM or outreach sequence.
- Adding multilingual contact-page paths (
/impressum,/kapcsolat,/contactez-nous, etc.) jumps the EU email hit rate from ~5% to 85%, driven by legal disclosure requirements like the GermanImpressum. - Built-in validation eliminates the need for third-party validators (ZeroBounce, Kickbox, etc.), saving $4β$10 per 1,000 leads while maintaining sub-50ms per-domain probe latency.
Core Solution
The enhanced architecture introduces two critical layers to the standard pipeline: a five-layer DNS/SMTP validation probe and a localized URL frontier crawler. Both run inline, keeping latency low and costs predictable.
1. Inline Email Validation (MX/SPF/DMARC + Catch-All Detection)
Per-domain DNS probing takes ~50ms. For 100 emails, total validation time stays under 5 seconds. Domain-level caching drastically reduces redundant lookups across batches.
const dns = require('dns/promises');
async function validateEmail(email) {
const [, domain] = email.split('@');
const result = { mxRecords: 0, hasSpf: false, hasDmarc: false,
smtpValid: null, isCatchAll: null, deliverability: 'unknown' };
// 1. MX records β does the domain accept mail?
try {
const mx = await dns.resolveMx(domain);
result.mxRecords = mx.length;
} catch { result.mxRecords = 0; }
if (result.mxRecords === 0) {
result.deliverability = 'low';
return result;
}
// 2-3. SPF + DMARC TXT lookups
try {
const txt = await dns.resolveTxt(domain);
result.hasSpf = txt.some(arr => arr.join('').toLowerCase().startsWith('v=spf1'));
} catch {}
try {
const dmarcTxt = await dns.resolveTxt(`_dmarc.${domain}`);
result.hasDmarc = dmarcTxt.some(arr => arr.join('').toLowerCase().startsWith('v=dmarc1'));
} catch {}
// 4. SMTP RCPT TO probe (optional, often blocked by Gmail/Outlook)
// ... skipped for brevity, see full code
// 5. Roll up to grade
if (result.mxRecords > 0 && result.hasSpf && result.hasDmarc) {
result.deliverability = 'high';
} else if (result.mxRecords > 0) {
result.deliverability = 'medium';
}
return result;
}
2. Multilingual Contact-Page Crawl
Expands the URL frontier beyond /contact to match regional conventions and legal requirements.
const CONTACT_PATHS = [
// English
'/contact', '/contact-us', '/about', '/about-us',
// Hungarian
'/kapcsolat', '/elerhetoseg',
// German (Impressum is legally required)
'/kontakt', '/impressum', '/ansprechpartner',
// Spanish
'/contacto', '/contactar', '/contactenos',
// Italian
'/contatti', '/contattaci',
// French
'/contactez-nous', '/nous-contacter',
// Polish
'/kontakt-z-nami', '/kontakty',
// Czech / Slovak
'/kontakt', '/kontaktujte-nas',
// Portuguese (BR + PT)
'/contato', '/contatos', '/contacto',
// Dutch
'/over-ons',
];
async function crawlForEmails(baseUrl) {
const emails = new Set();
const re = /\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}\b/gi;
for (const path of CONTACT_PATHS) {
try {
const html = await fetch(new URL(path, baseUrl)).then(r => r.text());
(html.match(re) || []).forEach(e => emails.add(e.toLowerCase()));
} catch {}
}
return [...emails];
}
3. Architecture & Billing Decisions
- Preflight Budget Check: Estimates runtime before execution. If estimated time exceeds timeout, the run is refused (zero events charged) with explicit configuration guidance. Reduced timeout rate from 20% to ~0%.
- Pay-Per-Result Billing: Switched from CU-based to
PAY_PER_EVENT($0.005 per delivered lead + $0.00005 per run start). Failed/timed-out runs cost $0, providing predictable budgeting. - Delta Mode: Skips known
placeIdandcidbefore any billable event fires, ensuring recurring weekly scrapes only charge for genuinely new businesses.
// Skip-before-bill check
if (knownPlaceIds.has(item.placeId) || knownCids.has(item.cid)) {
continue; // no enrich, no email validation, no billing event
}
Pitfall Guide
- Relying Solely on Regex Extraction: Extracting emails without DNS/SMTP validation guarantees ~50% bounce rates, destroying sender domain reputation and triggering spam filters.
- Hardcoding English Contact Paths: Assuming
/contactor/aboutexists fails in EU markets where legal mandates (e.g., GermanImpressum) or localization dictate different URLs, resulting in near-zero hit rates. - Ignoring Catch-All & SPF/DMARC Status: Accepting emails from catch-all servers without verification leads to silent bounces. Missing SPF/DMARC at the receiver side increases the likelihood of your outreach landing in spam or being rejected outright.
- Billing Per Compute Unit (CU) Instead of Per Result: CU-based pricing makes costs unpredictable, especially for failed or timed-out runs. Switching to
PAY_PER_EVENTaligns costs directly with delivered value and eliminates waste. - Skipping Preflight Runtime Estimation: Launching scrapes without estimating runtime causes frequent timeouts and wasted events. Implementing a preflight check prevents execution when configurations exceed platform limits.
- Not Implementing Delta/Dedup Mode: Re-scraping identical
placeIdorcidvalues wastes bandwidth, compute, and budget. Delta checks must run before any enrichment or billing events to ensure you only pay for new leads.
Deliverables
- π Deployment Blueprint: Step-by-step architecture guide for implementing inline DNS validation + multilingual URL frontier on Apify, including preflight estimation, delta deduplication, and
PAY_PER_EVENTbilling configuration. - β Validation & Localization Checklist: Pre-run verification steps covering MX/SPF/DMARC probe thresholds, catch-all detection fallbacks, regional contact path coverage, and domain caching strategies to maintain sub-50ms latency.
- βοΈ Configuration Templates: Ready-to-use JSON inputs for
geoGridTiles(viewport density),maxResults(pagination limits), delta mode dataset ID injection, and customizableCONTACT_PATHSarrays for rapid market localization.
