Website Crawl — states, depth & request fingerprint

What our own crawler fetches, to what depth, how it identifies itself, and which surface each crawl feeds. Code: worker/src/handlers/crawl-website.js (crawlWebsite()), driven by worker/src/handlers/prospect-crawl-enrich.js, compose-deal-brief.js, and the advisor crawl_website tool.

Last updated 2026-05-24.

Request fingerprint (identical across all states)


User-Agent	`Mozilla/5.0 (compatible; GoldenPagesBot/1.0; +https://www.goldenpages.ie/bot)`
robots.txt token	`goldenpagesbot` (longest-prefix group match, else ``; Allow beats Disallow on a tie, ``/`$` wildcards honoured)
Accept	`text/html,application/xml,text/plain`
Redirects	followed
Timeouts	homepage 12s · each asset/sitemap file 5s · whole sitemap-index walk 6s
Caps	≤25 sitemap files · ≤5,000 URLs counted · ≤12 inner pages sampled (deep)

We identify honestly as a known Irish directory crawler — no impersonation. We report which AI crawlers a site blocks (gptbot, google-extended, ccbot, claudebot, anthropic-ai, perplexitybot, bytespider) as an "AI-search aware" signal; we never pretend to be them.

Always fetched first, in parallel with the homepage (≈0 added wall-time): robots.txt, llms.txt, sitemap.xml, sitemap_index.xml, wp-sitemap.xml → 6 requests baseline.

The three crawl states (depth tiers)

State	What it fetches	Page count	Inner pages	~Requests
1. Shallow	homepage + 5 well-known assets. Sitemap is probed only — if it's a `<sitemapindex>` we record how many child sitemaps exist but don't fetch them.	`null` on index sites (flat `<urlset>` still counts)	none	~6
2. Full-sitemap	Shallow + BFS-walk the sitemap index (child sitemaps in parallel batches, within the 25-file / 6s / 5,000-URL caps).	exact `page_count` + `content_page_count` (taxonomy archives stripped) + `by_type`	none	6 + up to 25
3. Deep	Full-sitemap + sample up to 12 inner pages (prioritised: service → location → about → contact → pricing → product → blog), batches of 4, paced by `crawl-delay` when respecting robots.	same as Full-sitemap	up to 12: per-type counts, cross-page NAP/social/schema/tracking	6 + ≤25 + ≤12

Depth is set by two params: deep (→ inner-page sampling, forces full sitemap) and sitemap_depth (shallow vs full; ignored when deep:true).

What each state extracts for email/phone

Homepage NAP (tel: phones + emails via mailto: AND plain-text, through the shared isJunkEmail filter) is captured on every state — it's a function of the homepage HTML we already have, no extra fetch.
Inner-page emails/phones are added only in Deep (sampled contact/about/etc. pages).
The promoted crawl_email (BQ, prospect states only) is chosen by pickCrawlEmail(crawl_root): same-domain-role > same-domain > freemail, dropping off-domain non-free addresses and pages with >5 distinct emails (aggregator/directory guard). The full list is kept in crawl_raw.

robots posture (orthogonal to depth — set per caller)

respect_robots:true (the scaled prospect crawl): read robots.txt first and bail before touching the site if our bot is disallowed (returns robots_blocked:true — itself a recorded signal); deep sampling filters candidate URLs by robots and paces by crawl-delay.
respect_robots:false (advisor + deal-brief): a rep explicitly asked to see one specific site, so we fetch the homepage regardless. robots/ai_bots_blocked are still reported.

Who runs which state — and where the result goes

Caller	deep	sitemap_depth	respect_robots	State	Result written to
`prospect-crawl-enrich` shallow (every-minute cron)	false	shallow	true	1 Shallow	`PROSPECT_LISTINGS` MERGE on `crawl_root` (fans to all sibling listings): `crawl_` + `fit_` + `crawl_email`/`crawl_phone` + `crawl_raw`
`prospect-crawl-enrich` deep (PAUSED 2026-05-24)	true	full	true	3 Deep	same MERGE + `crawl_deep_at` + service/location page counts
`compose-deal-brief`	false	shallow	false	1 Shallow	deal-brief KV (`deal-brief:<id>`), surfaced in the brief's `website` block
advisor `crawl_website` (default)	false	(full)	false	2 Full-sitemap	chat response; optionally vectorized into `fcr-site-portfolio`
advisor `crawl_website` `deep:true`	true	full	false	3 Deep	chat response; optionally vectorized

Only the two prospect-crawl-enrich states write to PROSPECT_LISTINGS. The crawl is free and runs on every row with a real website (it is NOT gated by enrich_enabled — that gate is for paid SerpAPI/Pleper/Ahrefs enrichment only).

Extracted parameters (full field inventory)

What crawlWebsite() returns, which depth state populates it, and whether the prospect crawl persists it to PROSPECT_LISTINGS. Homepage-derived fields are present in all three states (S/F/D); page counts need Full or Deep; a handful are Deep-only.

Field	What	State	Persisted to PROSPECT_LISTINGS
`title`	`<title>`	S/F/D	— (returned only; not a column, not in crawl_raw)
`metaDescription`	`<meta name=description>`	S/F/D	—
`metaKeywords`	`<meta name=keywords>`	S/F/D	—
`textPreview`	first 1,500 chars of body text	S/F/D	—
`wordCount`	homepage word count	S/F/D	`crawl_word_count`
`technical.is_https`	SSL	S/F/D	`crawl_has_ssl`
`technical.has_viewport`	mobile-ready	S/F/D	`crawl_mobile_ready`
`technical.has_favicon`	favicon present	S/F/D	—
`technical.has_og` / `og_title`	Open Graph tags	S/F/D	—
`technical.has_canonical`	canonical link	S/F/D	—
`technical.h1`	first H1 text	S/F/D	—
`features.detected_cms`	WordPress/Wix/Shopify/…	S/F/D	`crawl_cms` (+ derived `crawl_is_spa`, `crawl_is_diy_builder`)
`features.has_ecommerce`	product schema / Woo / Shopify / cart markup	S/F/D	`crawl_has_ecommerce`
`features.has_blog`	blog present	S/F/D	`crawl_has_blog`
`features.has_booking`	online booking (Calendly/Setmore/…)	S/F/D	`crawl_has_booking`
`features.has_chat`	live chat widget	S/F/D	`crawl_has_chat`
`features.has_forms`	contact/enquiry form	S/F/D	`crawl_has_form`
`features.has_gallery`	gallery/portfolio	S/F/D	—
`features.has_phone`	phone on page	S/F/D	— (see `crawl_phone`)
`technologies[]`	Wappalyzer-style stack	S/F/D	`crawl_raw` (CMS/ads/GA/pixel promoted to flags below)
`technologies` → analytics/ads/pixel	GA / Google Ads / Meta Pixel tags	S/F/D	`crawl_has_analytics`, `crawl_has_google_ads_tag`, `crawl_has_fb_pixel`
`structuredData.has_json_ld` / `types[]`	JSON-LD schema	S/F/D	`crawl_has_schema` (+ types in `crawl_raw`)
`paymentProcessors[]`	Stripe/PayPal/Klarna/…	S/F/D	`crawl_raw`
`aiReadiness`	llms.txt / robots.txt / sitemap / structured-data / `ai_bots_blocked[]`	S/F/D	`crawl_ai_ready` (derived) + `ai_bots_blocked` in `crawl_raw`
`robots`	found / homepage_allowed / crawl_delay / ai_bots_blocked	S/F/D	via `crawl_error='robots_blocked'` when disallowed
`social{}`	Facebook/Insta/LinkedIn/… profile links	S/F/D	`crawl_raw`
`contact.phones[]` / `contact.emails[]`	homepage NAP (`mailto:` + plain-text)	S/F/D	`crawl_phone` / `crawl_email` (ranked picker)
`pageCount`	total sitemap URLs	F/D (S: null on index sites)	(raw count in `crawl_raw`)
`contentPageCount`	real pages, taxonomy archives stripped	F/D	`crawl_page_count`
`sitemap.by_type`	per-type breakdown (post/page/product/tag/…)	F/D	`crawl_raw`
`deep.service_page_count`	# service pages	D only	`crawl_service_page_count`
`deep.location_page_count`	# location pages	D only	`crawl_location_page_count`
`deep.has_about/has_contact/has_pricing`	trust-page presence	D only	`crawl_raw`
`deep.emails/phones`	NAP across sampled pages	D only	feeds `crawl_email`/`crawl_phone`
`deep.tracking`	GA / Google Ads / FB pixel across pages	D only	(promoted to the flags above)
`deep.sample[]`	per-page {url, type, wordCount}	D only	`crawl_raw` (trimmed)
`fit_sitepro/seo/estore/ads/ads_crosssell/social`	derived solution-fit flags	S/F/D	`fit_*` columns

Note: title / metaDescription / metaKeywords are extracted and returned to the advisor + deal-brief, but are not stored on PROSPECT_LISTINGS (no column, not in crawl_raw). The richer presence-fingerprint fields (trust-page composite, lead-capture vendor, freshness, performance) are proposed, not built — see docs/crawl-presence-fingerprint-brief.md.

Which crawler runs in which flow (important — these scripts aren't the only one)

GoldenPagesBot + robots-respect applies to the background book sweep only. The interactive flows use different paths:

Flow	Crawler	User-Agent	robots.txt
Background book sweep (`prospect-crawl-enrich`, cron)	`crawl-website.js`	GoldenPagesBot	respected (bails if disallowed)
Advisor `crawl_website` (Roam / Discovery drill-in)	`crawl-website.js`	GoldenPagesBot	not respected (rep asked to see one site)
Deal-brief compose	`crawl-website.js`	GoldenPagesBot	not respected
`/prospect` skill quick crawl (Step 1c)	WebFetch (Claude Code)	WebFetch UA — not GoldenPagesBot	skipped (single homepage fetch for brand assets)
InSites Digital Footprint (`submit_insites_df`)	InSites/Yext 3rd-party	their crawler	their policy
LRC / `run_serp_grid`	SerpAPI (no site crawl)	n/a	n/a

So: a quick /prospect run is not GoldenPagesBot and does not read robots; the Discovery/advisor crawl is GoldenPagesBot but does not respect robots; only the scaled background crawl is both.

Keyword identification is a separate pipeline (not these scripts)

The "3 relevant keywords" in /prospect and LRC come from the keyword stack, not the website crawl:

prospect_intel (proven KEYWORD_INTELLIGENCE terms, commercial-intent first) + suggest_insites_params,
the GP listing's keywords field,
CATEGORY_BENCHMARKS + Ahrefs volumes,
LRC submission = manual_keywords × locations cross-product.

The crawl only contributes on-page seeds (meta title/keywords, H1/nav labels, service-page names) that can inform those suggestions — it is not the keyword engine and does not rank or select keywords.

Current status

Deep is paused (its multi-page burst tripped a site's bot-detection — see CRAWL_ENRICH_HANDOFF_2026-05-24.md item #1). Live crawls are Shallow only, so crawl_email currently comes from the homepage alone until deep is re-enabled with a throttled plan.
Email extraction is on-page only. Off-page (backlinks, organic rank, GBP, reviews, social reach, ad spend) stays in the InSites Digital Footprint audit (submit_insites_df).

FCR Dashboard documentation · generated from docs/ · keep counts verified, not guessed.