Website Crawl — states, depth & request fingerprint
What our own crawler fetches, to what depth, how it identifies itself, and which surface each crawl feeds. Code: worker/src/handlers/crawl-website.js (crawlWebsite()), driven by worker/src/handlers/prospect-crawl-enrich.js, compose-deal-brief.js, and the advisor crawl_website tool.
Last updated 2026-05-24.
Request fingerprint (identical across all states)
| User-Agent | Mozilla/5.0 (compatible; GoldenPagesBot/1.0; +https://www.goldenpages.ie/bot) |
| robots.txt token | goldenpagesbot (longest-prefix group match, else *; Allow beats Disallow on a tie, */$ wildcards honoured) |
| Accept | text/html,application/xml,text/plain |
| Redirects | followed |
| Timeouts | homepage 12s · each asset/sitemap file 5s · whole sitemap-index walk 6s |
| Caps | ≤25 sitemap files · ≤5,000 URLs counted · ≤12 inner pages sampled (deep) |
We identify honestly as a known Irish directory crawler — no impersonation. We report which AI crawlers a site blocks (gptbot, google-extended, ccbot, claudebot, anthropic-ai, perplexitybot, bytespider) as an "AI-search aware" signal; we never pretend to be them.
Always fetched first, in parallel with the homepage (≈0 added wall-time): robots.txt, llms.txt, sitemap.xml, sitemap_index.xml, wp-sitemap.xml → 6 requests baseline.
The three crawl states (depth tiers)
| State | What it fetches | Page count | Inner pages | ~Requests |
|---|---|---|---|---|
| 1. Shallow | homepage + 5 well-known assets. Sitemap is probed only — if it's a <sitemapindex> we record how many child sitemaps exist but don't fetch them. |
null on index sites (flat <urlset> still counts) |
none | ~6 |
| 2. Full-sitemap | Shallow + BFS-walk the sitemap index (child sitemaps in parallel batches, within the 25-file / 6s / 5,000-URL caps). | exact page_count + content_page_count (taxonomy archives stripped) + by_type |
none | 6 + up to 25 |
| 3. Deep | Full-sitemap + sample up to 12 inner pages (prioritised: service → location → about → contact → pricing → product → blog), batches of 4, paced by crawl-delay when respecting robots. |
same as Full-sitemap | up to 12: per-type counts, cross-page NAP/social/schema/tracking | 6 + ≤25 + ≤12 |
Depth is set by two params: deep (→ inner-page sampling, forces full sitemap) and sitemap_depth (shallow vs full; ignored when deep:true).
What each state extracts for email/phone
- Homepage NAP (
tel:phones + emails viamailto:AND plain-text, through the sharedisJunkEmailfilter) is captured on every state — it's a function of the homepage HTML we already have, no extra fetch. - Inner-page emails/phones are added only in Deep (sampled contact/about/etc. pages).
- The promoted
crawl_email(BQ, prospect states only) is chosen bypickCrawlEmail(crawl_root): same-domain-role > same-domain > freemail, dropping off-domain non-free addresses and pages with >5 distinct emails (aggregator/directory guard). The full list is kept incrawl_raw.
robots posture (orthogonal to depth — set per caller)
respect_robots:true(the scaled prospect crawl): readrobots.txtfirst and bail before touching the site if our bot is disallowed (returnsrobots_blocked:true— itself a recorded signal); deep sampling filters candidate URLs by robots and paces bycrawl-delay.respect_robots:false(advisor + deal-brief): a rep explicitly asked to see one specific site, so we fetch the homepage regardless.robots/ai_bots_blockedare still reported.
Who runs which state — and where the result goes
| Caller | deep | sitemap_depth | respect_robots | State | Result written to |
|---|---|---|---|---|---|
prospect-crawl-enrich shallow (every-minute cron) |
false | shallow | true | 1 Shallow | PROSPECT_LISTINGS MERGE on crawl_root (fans to all sibling listings): crawl_* + fit_* + crawl_email/crawl_phone + crawl_raw |
prospect-crawl-enrich deep (PAUSED 2026-05-24) |
true | full | true | 3 Deep | same MERGE + crawl_deep_at + service/location page counts |
compose-deal-brief |
false | shallow | false | 1 Shallow | deal-brief KV (deal-brief:<id>), surfaced in the brief's website block |
advisor crawl_website (default) |
false | (full) | false | 2 Full-sitemap | chat response; optionally vectorized into fcr-site-portfolio |
advisor crawl_website deep:true |
true | full | false | 3 Deep | chat response; optionally vectorized |
Only the two prospect-crawl-enrich states write to PROSPECT_LISTINGS. The crawl is free and runs on every row with a real website (it is NOT gated by enrich_enabled — that gate is for paid SerpAPI/Pleper/Ahrefs enrichment only).
Extracted parameters (full field inventory)
What crawlWebsite() returns, which depth state populates it, and whether the prospect crawl persists it to PROSPECT_LISTINGS. Homepage-derived fields are present in all three states (S/F/D); page counts need Full or Deep; a handful are Deep-only.
| Field | What | State | Persisted to PROSPECT_LISTINGS |
|---|---|---|---|
title |
<title> |
S/F/D | — (returned only; not a column, not in crawl_raw) |
metaDescription |
<meta name=description> |
S/F/D | — |
metaKeywords |
<meta name=keywords> |
S/F/D | — |
textPreview |
first 1,500 chars of body text | S/F/D | — |
wordCount |
homepage word count | S/F/D | crawl_word_count |
technical.is_https |
SSL | S/F/D | crawl_has_ssl |
technical.has_viewport |
mobile-ready | S/F/D | crawl_mobile_ready |
technical.has_favicon |
favicon present | S/F/D | — |
technical.has_og / og_title |
Open Graph tags | S/F/D | — |
technical.has_canonical |
canonical link | S/F/D | — |
technical.h1 |
first H1 text | S/F/D | — |
features.detected_cms |
WordPress/Wix/Shopify/… | S/F/D | crawl_cms (+ derived crawl_is_spa, crawl_is_diy_builder) |
features.has_ecommerce |
product schema / Woo / Shopify / cart markup | S/F/D | crawl_has_ecommerce |
features.has_blog |
blog present | S/F/D | crawl_has_blog |
features.has_booking |
online booking (Calendly/Setmore/…) | S/F/D | crawl_has_booking |
features.has_chat |
live chat widget | S/F/D | crawl_has_chat |
features.has_forms |
contact/enquiry form | S/F/D | crawl_has_form |
features.has_gallery |
gallery/portfolio | S/F/D | — |
features.has_phone |
phone on page | S/F/D | — (see crawl_phone) |
technologies[] |
Wappalyzer-style stack | S/F/D | crawl_raw (CMS/ads/GA/pixel promoted to flags below) |
technologies → analytics/ads/pixel |
GA / Google Ads / Meta Pixel tags | S/F/D | crawl_has_analytics, crawl_has_google_ads_tag, crawl_has_fb_pixel |
structuredData.has_json_ld / types[] |
JSON-LD schema | S/F/D | crawl_has_schema (+ types in crawl_raw) |
paymentProcessors[] |
Stripe/PayPal/Klarna/… | S/F/D | crawl_raw |
aiReadiness |
llms.txt / robots.txt / sitemap / structured-data / ai_bots_blocked[] |
S/F/D | crawl_ai_ready (derived) + ai_bots_blocked in crawl_raw |
robots |
found / homepage_allowed / crawl_delay / ai_bots_blocked | S/F/D | via crawl_error='robots_blocked' when disallowed |
social{} |
Facebook/Insta/LinkedIn/… profile links | S/F/D | crawl_raw |
contact.phones[] / contact.emails[] |
homepage NAP (mailto: + plain-text) |
S/F/D | crawl_phone / crawl_email (ranked picker) |
pageCount |
total sitemap URLs | F/D (S: null on index sites) | (raw count in crawl_raw) |
contentPageCount |
real pages, taxonomy archives stripped | F/D | crawl_page_count |
sitemap.by_type |
per-type breakdown (post/page/product/tag/…) | F/D | crawl_raw |
deep.service_page_count |
# service pages | D only | crawl_service_page_count |
deep.location_page_count |
# location pages | D only | crawl_location_page_count |
deep.has_about/has_contact/has_pricing |
trust-page presence | D only | crawl_raw |
deep.emails/phones |
NAP across sampled pages | D only | feeds crawl_email/crawl_phone |
deep.tracking |
GA / Google Ads / FB pixel across pages | D only | (promoted to the flags above) |
deep.sample[] |
per-page {url, type, wordCount} | D only | crawl_raw (trimmed) |
fit_sitepro/seo/estore/ads/ads_crosssell/social |
derived solution-fit flags | S/F/D | fit_* columns |
Note: title / metaDescription / metaKeywords are extracted and returned to the advisor + deal-brief, but are not stored on PROSPECT_LISTINGS (no column, not in crawl_raw). The richer presence-fingerprint fields (trust-page composite, lead-capture vendor, freshness, performance) are proposed, not built — see docs/crawl-presence-fingerprint-brief.md.
Which crawler runs in which flow (important — these scripts aren't the only one)
GoldenPagesBot + robots-respect applies to the background book sweep only. The interactive flows use different paths:
| Flow | Crawler | User-Agent | robots.txt |
|---|---|---|---|
Background book sweep (prospect-crawl-enrich, cron) |
crawl-website.js |
GoldenPagesBot | respected (bails if disallowed) |
Advisor crawl_website (Roam / Discovery drill-in) |
crawl-website.js |
GoldenPagesBot | not respected (rep asked to see one site) |
| Deal-brief compose | crawl-website.js |
GoldenPagesBot | not respected |
/prospect skill quick crawl (Step 1c) |
WebFetch (Claude Code) | WebFetch UA — not GoldenPagesBot | skipped (single homepage fetch for brand assets) |
InSites Digital Footprint (submit_insites_df) |
InSites/Yext 3rd-party | their crawler | their policy |
LRC / run_serp_grid |
SerpAPI (no site crawl) | n/a | n/a |
So: a quick /prospect run is not GoldenPagesBot and does not read robots; the Discovery/advisor crawl is GoldenPagesBot but does not respect robots; only the scaled background crawl is both.
Keyword identification is a separate pipeline (not these scripts)
The "3 relevant keywords" in /prospect and LRC come from the keyword stack, not the website crawl:
prospect_intel(provenKEYWORD_INTELLIGENCEterms, commercial-intent first) +suggest_insites_params,- the GP listing's
keywordsfield, CATEGORY_BENCHMARKS+ Ahrefs volumes,- LRC submission =
manual_keywords × locationscross-product.
The crawl only contributes on-page seeds (meta title/keywords, H1/nav labels, service-page names) that can inform those suggestions — it is not the keyword engine and does not rank or select keywords.
Current status
- Deep is paused (its multi-page burst tripped a site's bot-detection — see
CRAWL_ENRICH_HANDOFF_2026-05-24.mditem #1). Live crawls are Shallow only, socrawl_emailcurrently comes from the homepage alone until deep is re-enabled with a throttled plan. - Email extraction is on-page only. Off-page (backlinks, organic rank, GBP, reviews, social reach, ad spend) stays in the InSites Digital Footprint audit (
submit_insites_df).
FCR Dashboard documentation · generated from docs/ · keep counts verified, not guessed.