Website Crawl — states, depth & request fingerprint

What our own crawler fetches, to what depth, how it identifies itself, and which surface each crawl feeds. Code: worker/src/handlers/crawl-website.js (crawlWebsite()), driven by worker/src/handlers/prospect-crawl-enrich.js, compose-deal-brief.js, and the advisor crawl_website tool.

Last updated 2026-05-24.

Request fingerprint (identical across all states)

User-Agent Mozilla/5.0 (compatible; GoldenPagesBot/1.0; +https://www.goldenpages.ie/bot)
robots.txt token goldenpagesbot (longest-prefix group match, else *; Allow beats Disallow on a tie, */$ wildcards honoured)
Accept text/html,application/xml,text/plain
Redirects followed
Timeouts homepage 12s · each asset/sitemap file 5s · whole sitemap-index walk 6s
Caps ≤25 sitemap files · ≤5,000 URLs counted · ≤12 inner pages sampled (deep)

We identify honestly as a known Irish directory crawler — no impersonation. We report which AI crawlers a site blocks (gptbot, google-extended, ccbot, claudebot, anthropic-ai, perplexitybot, bytespider) as an "AI-search aware" signal; we never pretend to be them.

Always fetched first, in parallel with the homepage (≈0 added wall-time): robots.txt, llms.txt, sitemap.xml, sitemap_index.xml, wp-sitemap.xml → 6 requests baseline.

The three crawl states (depth tiers)

State What it fetches Page count Inner pages ~Requests
1. Shallow homepage + 5 well-known assets. Sitemap is probed only — if it's a <sitemapindex> we record how many child sitemaps exist but don't fetch them. null on index sites (flat <urlset> still counts) none ~6
2. Full-sitemap Shallow + BFS-walk the sitemap index (child sitemaps in parallel batches, within the 25-file / 6s / 5,000-URL caps). exact page_count + content_page_count (taxonomy archives stripped) + by_type none 6 + up to 25
3. Deep Full-sitemap + sample up to 12 inner pages (prioritised: service → location → about → contact → pricing → product → blog), batches of 4, paced by crawl-delay when respecting robots. same as Full-sitemap up to 12: per-type counts, cross-page NAP/social/schema/tracking 6 + ≤25 + ≤12

Depth is set by two params: deep (→ inner-page sampling, forces full sitemap) and sitemap_depth (shallow vs full; ignored when deep:true).

What each state extracts for email/phone

  • Homepage NAP (tel: phones + emails via mailto: AND plain-text, through the shared isJunkEmail filter) is captured on every state — it's a function of the homepage HTML we already have, no extra fetch.
  • Inner-page emails/phones are added only in Deep (sampled contact/about/etc. pages).
  • The promoted crawl_email (BQ, prospect states only) is chosen by pickCrawlEmail(crawl_root): same-domain-role > same-domain > freemail, dropping off-domain non-free addresses and pages with >5 distinct emails (aggregator/directory guard). The full list is kept in crawl_raw.

robots posture (orthogonal to depth — set per caller)

  • respect_robots:true (the scaled prospect crawl): read robots.txt first and bail before touching the site if our bot is disallowed (returns robots_blocked:true — itself a recorded signal); deep sampling filters candidate URLs by robots and paces by crawl-delay.
  • respect_robots:false (advisor + deal-brief): a rep explicitly asked to see one specific site, so we fetch the homepage regardless. robots/ai_bots_blocked are still reported.

Who runs which state — and where the result goes

Caller deep sitemap_depth respect_robots State Result written to
prospect-crawl-enrich shallow (every-minute cron) false shallow true 1 Shallow PROSPECT_LISTINGS MERGE on crawl_root (fans to all sibling listings): crawl_* + fit_* + crawl_email/crawl_phone + crawl_raw
prospect-crawl-enrich deep (PAUSED 2026-05-24) true full true 3 Deep same MERGE + crawl_deep_at + service/location page counts
compose-deal-brief false shallow false 1 Shallow deal-brief KV (deal-brief:<id>), surfaced in the brief's website block
advisor crawl_website (default) false (full) false 2 Full-sitemap chat response; optionally vectorized into fcr-site-portfolio
advisor crawl_website deep:true true full false 3 Deep chat response; optionally vectorized

Only the two prospect-crawl-enrich states write to PROSPECT_LISTINGS. The crawl is free and runs on every row with a real website (it is NOT gated by enrich_enabled — that gate is for paid SerpAPI/Pleper/Ahrefs enrichment only).

Extracted parameters (full field inventory)

What crawlWebsite() returns, which depth state populates it, and whether the prospect crawl persists it to PROSPECT_LISTINGS. Homepage-derived fields are present in all three states (S/F/D); page counts need Full or Deep; a handful are Deep-only.

Field What State Persisted to PROSPECT_LISTINGS
title <title> S/F/D — (returned only; not a column, not in crawl_raw)
metaDescription <meta name=description> S/F/D
metaKeywords <meta name=keywords> S/F/D
textPreview first 1,500 chars of body text S/F/D
wordCount homepage word count S/F/D crawl_word_count
technical.is_https SSL S/F/D crawl_has_ssl
technical.has_viewport mobile-ready S/F/D crawl_mobile_ready
technical.has_favicon favicon present S/F/D
technical.has_og / og_title Open Graph tags S/F/D
technical.has_canonical canonical link S/F/D
technical.h1 first H1 text S/F/D
features.detected_cms WordPress/Wix/Shopify/… S/F/D crawl_cms (+ derived crawl_is_spa, crawl_is_diy_builder)
features.has_ecommerce product schema / Woo / Shopify / cart markup S/F/D crawl_has_ecommerce
features.has_blog blog present S/F/D crawl_has_blog
features.has_booking online booking (Calendly/Setmore/…) S/F/D crawl_has_booking
features.has_chat live chat widget S/F/D crawl_has_chat
features.has_forms contact/enquiry form S/F/D crawl_has_form
features.has_gallery gallery/portfolio S/F/D
features.has_phone phone on page S/F/D — (see crawl_phone)
technologies[] Wappalyzer-style stack S/F/D crawl_raw (CMS/ads/GA/pixel promoted to flags below)
technologies → analytics/ads/pixel GA / Google Ads / Meta Pixel tags S/F/D crawl_has_analytics, crawl_has_google_ads_tag, crawl_has_fb_pixel
structuredData.has_json_ld / types[] JSON-LD schema S/F/D crawl_has_schema (+ types in crawl_raw)
paymentProcessors[] Stripe/PayPal/Klarna/… S/F/D crawl_raw
aiReadiness llms.txt / robots.txt / sitemap / structured-data / ai_bots_blocked[] S/F/D crawl_ai_ready (derived) + ai_bots_blocked in crawl_raw
robots found / homepage_allowed / crawl_delay / ai_bots_blocked S/F/D via crawl_error='robots_blocked' when disallowed
social{} Facebook/Insta/LinkedIn/… profile links S/F/D crawl_raw
contact.phones[] / contact.emails[] homepage NAP (mailto: + plain-text) S/F/D crawl_phone / crawl_email (ranked picker)
pageCount total sitemap URLs F/D (S: null on index sites) (raw count in crawl_raw)
contentPageCount real pages, taxonomy archives stripped F/D crawl_page_count
sitemap.by_type per-type breakdown (post/page/product/tag/…) F/D crawl_raw
deep.service_page_count # service pages D only crawl_service_page_count
deep.location_page_count # location pages D only crawl_location_page_count
deep.has_about/has_contact/has_pricing trust-page presence D only crawl_raw
deep.emails/phones NAP across sampled pages D only feeds crawl_email/crawl_phone
deep.tracking GA / Google Ads / FB pixel across pages D only (promoted to the flags above)
deep.sample[] per-page {url, type, wordCount} D only crawl_raw (trimmed)
fit_sitepro/seo/estore/ads/ads_crosssell/social derived solution-fit flags S/F/D fit_* columns

Note: title / metaDescription / metaKeywords are extracted and returned to the advisor + deal-brief, but are not stored on PROSPECT_LISTINGS (no column, not in crawl_raw). The richer presence-fingerprint fields (trust-page composite, lead-capture vendor, freshness, performance) are proposed, not built — see docs/crawl-presence-fingerprint-brief.md.

Which crawler runs in which flow (important — these scripts aren't the only one)

GoldenPagesBot + robots-respect applies to the background book sweep only. The interactive flows use different paths:

Flow Crawler User-Agent robots.txt
Background book sweep (prospect-crawl-enrich, cron) crawl-website.js GoldenPagesBot respected (bails if disallowed)
Advisor crawl_website (Roam / Discovery drill-in) crawl-website.js GoldenPagesBot not respected (rep asked to see one site)
Deal-brief compose crawl-website.js GoldenPagesBot not respected
/prospect skill quick crawl (Step 1c) WebFetch (Claude Code) WebFetch UA — not GoldenPagesBot skipped (single homepage fetch for brand assets)
InSites Digital Footprint (submit_insites_df) InSites/Yext 3rd-party their crawler their policy
LRC / run_serp_grid SerpAPI (no site crawl) n/a n/a

So: a quick /prospect run is not GoldenPagesBot and does not read robots; the Discovery/advisor crawl is GoldenPagesBot but does not respect robots; only the scaled background crawl is both.

Keyword identification is a separate pipeline (not these scripts)

The "3 relevant keywords" in /prospect and LRC come from the keyword stack, not the website crawl:

  • prospect_intel (proven KEYWORD_INTELLIGENCE terms, commercial-intent first) + suggest_insites_params,
  • the GP listing's keywords field,
  • CATEGORY_BENCHMARKS + Ahrefs volumes,
  • LRC submission = manual_keywords × locations cross-product.

The crawl only contributes on-page seeds (meta title/keywords, H1/nav labels, service-page names) that can inform those suggestions — it is not the keyword engine and does not rank or select keywords.

Current status

  • Deep is paused (its multi-page burst tripped a site's bot-detection — see CRAWL_ENRICH_HANDOFF_2026-05-24.md item #1). Live crawls are Shallow only, so crawl_email currently comes from the homepage alone until deep is re-enabled with a throttled plan.
  • Email extraction is on-page only. Off-page (backlinks, organic rank, GBP, reviews, social reach, ad spend) stays in the InSites Digital Footprint audit (submit_insites_df).

FCR Dashboard documentation · generated from docs/ · keep counts verified, not guessed.

Ask the docsRAG over this site
Ask anything about the FCR Dashboard platform — architecture, BigQuery, the worker routes, billing rules, the LRC stack, scoring… Answers are grounded in this documentation, with source links.
How does the deal-brief refresh work? Which routes are Worker vs n8n? How is account health scored?