Browser Automation & DIY Techniques

Playwright, Puppeteer, Selenium, community tricks from Reddit

Headless Browser Automation (Playwright, Puppeteer, Selenium)

Browser Automation Frameworks

The foundation technique used by most scrapers. A browser renders Maps, executes JS, and extracts from the DOM.

Playwright (Most Popular)

Microsoft's library. Used by gosom/google-maps-scraper, HasData, many commercial tools. Supports Chromium, Firefox, WebKit.

Puppeteer (Node.js)

Google's Node.js browser automation. puppeteer-extra-plugin-stealth provides 17 evasion modules.

Selenium (Legacy)

Original framework. undetected-chromedriver patches for detection evasion.

Extraction Strategies

StrategyToken CostWhen to Use
CSS selectors (querySelectorAll)~52/itemKnown structure — default choice
aria-label attributes~52/itemMore resilient — accessibility attrs are stabler than CSS classes
body.innerText~5K/pageDiscovery — learn structure once, then switch
Network/XHR interceptionMinimalCapture protobuf responses directly — best approach
Accessibility tree (filtered)~28K/pageFind buttons, forms, interactive elements
Screenshot~132KCAPTCHA solving, visual debugging only

DIY Scraping Techniques (Reddit & Community)

Community-Sourced Techniques

Practical techniques from Reddit, HN, and forums — the stuff not in vendor blog posts.

Grid-Scanning / Map Splitting (Essential)

Overcome the 120-result limit per search by dividing areas into smaller grid cells. Auto-adjust zoom to ~16 per cell. Can use QGIS + Python to generate coordinate grids. Tune tightness to urban density.

https://www.google.com/maps/search/{query}/@{lat},{long},16z

Used by every serious scraper (gosom, Apify actors, Octoparse). Hexagonal grid sampling is the academic version.

"Search in This Area" Automation

Moving the map and clicking "search in this area" reveals different/more results than initial search. Automate by systematically panning across coordinates.

HTTP-Only / No Browser Approach (50x Cheaper)

Replicate Google Maps' internal protobuf HTTP calls directly — no headless browser. One HN user reduced costs from $50K/mo to $1K/mo. "Built entirely with cURL, avoiding headless browsers since they're really slow." Google rarely updates their map API structure.

Dual-Architecture Engine

Use high-performance HTTP extraction for bulk data, only spin up headless browsers for dynamic elements. Extracts Booking URL, FID, CID from metadata. Faster than pure Playwright/Puppeteer.

Place ID Pre-Extraction

Get Place IDs first (cheap/fast), then pipe them for direct detail extraction without browser search overhead — data comes back "almost instantly."

CID-Based Direct Access

CID (Customer ID) is a 64-bit decimal that never changes, even through rebrand/relocation. More permanent than Place ID (which expires after 12 months).

https://www.google.com/maps/place/?cid=<CID_NUMBER>

Find CID: URL ?cid= parameter, or inspect Knowledge Panel for "ludocid".

Reference: Scrap.io CID Guide, CID Converter

LLM-Powered Query Expansion

Use LLMs to generate search term synonyms to overcome per-query result limits. "dentist" becomes "dental clinic", "oral surgeon", "dental practice", "orthodontist" etc. Multiplies coverage without geographic splitting.

Other Community Tricks

Sources