← Back to Blog
DevOps2026-05-05Β·41 min read

My Google Maps scraper is live for 2 weeks. Half the emails were bounce-bait β€” here's what I added.

By ByteHarvester

My Google Maps Scraper is Live for 2 Weeks. Half the Emails Were Bounce-Bait β€” Here's What I Added.

Current Situation Analysis

The standard Google Maps scraping pipeline on platforms like Apify follows a rigid, unvalidated workflow: Search Google Maps β†’ harvest place URLs β†’ visit each website β†’ regex-grep emails β†’ return JSON. While this approach produces seemingly complete JSON outputs, it fundamentally ignores email deliverability and regional web conventions.

In practice, ~50% of scraped emails fail delivery due to:

  • Typos & Dead Domains: Malformed addresses (info@compant.com) or domains returning NXDOMAIN on MX lookup.
  • Catch-All Servers: Domains that accept any RCPT TO command but silently bounce or discard messages later.
  • Missing Authentication: Absence of SPF/DMARC records at the receiver side, which severely damages sender domain reputation when bulk outreach is attempted.
  • Localization Blind Spots: Non-English markets rarely use /contact. EU countries legally mandate or culturally prefer localized paths (e.g., German Impressum, Hungarian kapcsolat, Spanish contacto). Traditional scrapers return near-zero hits in these regions because they only probe English-centric URL frontiers.

Without inline validation and localized crawling, scraper outputs become bounce-bait, wasting compute cycles and risking IP/domain blacklisting.

WOW Moment: Key Findings

Approach Email Hit Rate High Deliverability Rate Cost per 1k Leads
Traditional Regex-Only Scraper ~50-60% ~25% $4.00 - $10.00
Enhanced Scraper (Inline Validation + Multilingual Crawl) ~53% (US) / ~85% (EU) ~25% (US) / ~65% (EU) $0.00 (Built-in)

Key Findings:

  • Inline DNS/SMTP validation filters out ~50% of unviable emails before they ever enter a CRM or outreach sequence.
  • Adding multilingual contact-page paths (/impressum, /kapcsolat, /contactez-nous, etc.) jumps the EU email hit rate from ~5% to 85%, driven by legal disclosure requirements like the German Impressum.
  • Built-in validation eliminates the need for third-party validators (ZeroBounce, Kickbox, etc.), saving $4–$10 per 1,000 leads while maintaining sub-50ms per-domain probe latency.

Core Solution

The enhanced architecture introduces two critical layers to the standard pipeline: a five-layer DNS/SMTP validation probe and a localized URL frontier crawler. Both run inline, keeping latency low and costs predictable.

1. Inline Email Validation (MX/SPF/DMARC + Catch-All Detection)

Per-domain DNS probing takes ~50ms. For 100 emails, total validation time stays under 5 seconds. Domain-level caching drastically reduces redundant lookups across batches.

const dns = require('dns/promises');

async function validateEmail(email) {
  const [, domain] = email.split('@');
  const result = { mxRecords: 0, hasSpf: false, hasDmarc: false,
                    smtpValid: null, isCatchAll: null, deliverability: 'unknown' };

  // 1. MX records β€” does the domain accept mail?
  try {
    const mx = await dns.resolveMx(domain);
    result.mxRecords = mx.length;
  } catch { result.mxRecords = 0; }

  if (result.mxRecords === 0) {
    result.deliverability = 'low';
    return result;
  }

  // 2-3. SPF + DMARC TXT lookups
  try {
    const txt = await dns.resolveTxt(domain);
    result.hasSpf = txt.some(arr => arr.join('').toLowerCase().startsWith('v=spf1'));
  } catch {}
  try {
    const dmarcTxt = await dns.resolveTxt(`_dmarc.${domain}`);
    result.hasDmarc = dmarcTxt.some(arr => arr.join('').toLowerCase().startsWith('v=dmarc1'));
  } catch {}

  // 4. SMTP RCPT TO probe (optional, often blocked by Gmail/Outlook)
  // ... skipped for brevity, see full code

  // 5. Roll up to grade
  if (result.mxRecords > 0 && result.hasSpf && result.hasDmarc) {
    result.deliverability = 'high';
  } else if (result.mxRecords > 0) {
    result.deliverability = 'medium';
  }

  return result;
}

2. Multilingual Contact-Page Crawl

Expands the URL frontier beyond /contact to match regional conventions and legal requirements.

const CONTACT_PATHS = [
  // English
  '/contact', '/contact-us', '/about', '/about-us',

  // Hungarian
  '/kapcsolat', '/elerhetoseg',

  // German (Impressum is legally required)
  '/kontakt', '/impressum', '/ansprechpartner',

  // Spanish
  '/contacto', '/contactar', '/contactenos',

  // Italian
  '/contatti', '/contattaci',

  // French
  '/contactez-nous', '/nous-contacter',

  // Polish
  '/kontakt-z-nami', '/kontakty',

  // Czech / Slovak
  '/kontakt', '/kontaktujte-nas',

  // Portuguese (BR + PT)
  '/contato', '/contatos', '/contacto',

  // Dutch
  '/over-ons',
];

async function crawlForEmails(baseUrl) {
  const emails = new Set();
  const re = /\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}\b/gi;
  for (const path of CONTACT_PATHS) {
    try {
      const html = await fetch(new URL(path, baseUrl)).then(r => r.text());
      (html.match(re) || []).forEach(e => emails.add(e.toLowerCase()));
    } catch {}
  }
  return [...emails];
}

3. Architecture & Billing Decisions

  • Preflight Budget Check: Estimates runtime before execution. If estimated time exceeds timeout, the run is refused (zero events charged) with explicit configuration guidance. Reduced timeout rate from 20% to ~0%.
  • Pay-Per-Result Billing: Switched from CU-based to PAY_PER_EVENT ($0.005 per delivered lead + $0.00005 per run start). Failed/timed-out runs cost $0, providing predictable budgeting.
  • Delta Mode: Skips known placeId and cid before any billable event fires, ensuring recurring weekly scrapes only charge for genuinely new businesses.
// Skip-before-bill check
if (knownPlaceIds.has(item.placeId) || knownCids.has(item.cid)) {
  continue; // no enrich, no email validation, no billing event
}

Pitfall Guide

  1. Relying Solely on Regex Extraction: Extracting emails without DNS/SMTP validation guarantees ~50% bounce rates, destroying sender domain reputation and triggering spam filters.
  2. Hardcoding English Contact Paths: Assuming /contact or /about exists fails in EU markets where legal mandates (e.g., German Impressum) or localization dictate different URLs, resulting in near-zero hit rates.
  3. Ignoring Catch-All & SPF/DMARC Status: Accepting emails from catch-all servers without verification leads to silent bounces. Missing SPF/DMARC at the receiver side increases the likelihood of your outreach landing in spam or being rejected outright.
  4. Billing Per Compute Unit (CU) Instead of Per Result: CU-based pricing makes costs unpredictable, especially for failed or timed-out runs. Switching to PAY_PER_EVENT aligns costs directly with delivered value and eliminates waste.
  5. Skipping Preflight Runtime Estimation: Launching scrapes without estimating runtime causes frequent timeouts and wasted events. Implementing a preflight check prevents execution when configurations exceed platform limits.
  6. Not Implementing Delta/Dedup Mode: Re-scraping identical placeId or cid values wastes bandwidth, compute, and budget. Delta checks must run before any enrichment or billing events to ensure you only pay for new leads.

Deliverables

  • πŸ“˜ Deployment Blueprint: Step-by-step architecture guide for implementing inline DNS validation + multilingual URL frontier on Apify, including preflight estimation, delta deduplication, and PAY_PER_EVENT billing configuration.
  • βœ… Validation & Localization Checklist: Pre-run verification steps covering MX/SPF/DMARC probe thresholds, catch-all detection fallbacks, regional contact path coverage, and domain caching strategies to maintain sub-50ms latency.
  • βš™οΈ Configuration Templates: Ready-to-use JSON inputs for geoGridTiles (viewport density), maxResults (pagination limits), delta mode dataset ID injection, and customizable CONTACT_PATHS arrays for rapid market localization.