Website Crawl — Presence-Fingerprint Fields

For: whoever's extending the website crawl. From: dashboard team (Cathal). Date: 2026-05-24. Code: worker/src/handlers/crawl-website.js (the crawlWebsite() return object). Context: EARLY_STAGE_OPPORTUNITY_SCAN_SPEC_2026-05-24.md + docs/opportunity-scan-class-benchmark-input.md.

Why we're looking at this

We're building an early-stage opportunity scan: for a prospect we haven't spoken to yet, look at their real digital footprint and recommend the right FCR product to lead with. The guiding rule is competitive-relative — a prospect needs to be better than the businesses that dominate their category locally, not the best in the world. So we recommend the minimum product tier that clears the local/metro benchmark for their class.

That model only works if we can answer, per category and area: "what does the dominant set actually have, and where does this prospect fall short?" The website crawl is the fact source for both sides of that comparison — we crawl the prospect, and we crawl the competitors who dominate their patch. So the crawl needs to capture the same digital-presence dimensions, the same way, for everyone, so the comparison is apples-to-apples and we can compute a per-class benchmark to beat.

This brief lists the fields to add and the one architectural rule that makes the comparison valid.

The architectural rule (this is the important bit)

Emit one canonical "presence fingerprint", identical for prospect and competitor, shallow and deep. Deep-only fields come back null on a shallow crawl — never absent. Then computing "the dominant set's average" is just an aggregation over the same keys.

Same extractor for the prospect crawl AND competitor-enrich. competitor-enrich must call the same code, not roll its own — otherwise the two sides aren't comparable.
Emit raw counts, not just booleans, wherever we need a class median to beat: page_count, service_page_count, location_page_count, on_site_review_count.
Add crawl_schema_version (bump on any shape change) so re-crawls and backfills stay comparable, and the brief knows when a re-crawl is due.
Respect the budget split: shallow crawls run in the ~15s compose path, so shallow = homepage-HTML-only (regex/DOM, no extra fetches). Anything needing per-page or extra requests goes to the deep/full path.

Fields to add

Field(s)	Why — which decision / benchmark it feeds	Depth
`service_page_count`, `location_page_count` on shallow (today deep-only)	The core "BizSite vs SitePro Multi vs SP20, sized to service range + locality" call. The book-wide scan runs shallow, so these must be available without a deep crawl.	shallow (sitemap/nav)
Landing-page-quality composite: `https`, `mobile_friendly` (viewport), trust pages (`about`/`contact`/`privacy`/`terms`), content depth	Operationalises the principle that Google weighs a site to judge ad-appropriateness. Becomes the ads gate — don't recommend ads until the site clears the bar.	shallow
Ad-readiness tags: `ga4`, `gtm`, `google_ads_tag`, `conversion_tracking` (Meta Pixel already detected)	"Already invests / is ad-ready" → SEA/Optimiser fit. Absence = tracking-gap pitch.	shallow
Lead-capture surface: `contact_form`, `booking_vendor`, `click_to_call` (tel: links), `whatsapp`, `live_chat_vendor`, `quote_cta`	"Has a site but it can't convert" opportunities. We have has_forms/has_booking/has_chat today but not which tool or whether it actually captures.	shallow
Content freshness: `latest_post_date`, `post_count`, `copyright_year`	A stale site ≠ a maintained one. Gates "personal development → active blog" and flags abandoned sites.	shallow
On-site reputation: `testimonials`, `google_reviews_widget`, `aggregate_rating_schema`, `map_embed`, on-site `nap` (name/address/phone)	SayMore signals + NAP-consistency check against the GBP. Prefer JSON-LD `LocalBusiness` as the NAP source where present (cleaner than page-text scraping).	shallow
Brand / identity assets: `logo_url`, `favicon_url`, `brand_colors` (`primary`/`secondary` hex)	The deep-crawl-mimics-InSites gap — InSites DF returns logo + `colour_scheme`. Feeds the auto-built proposal/deck (`client_brand` block) so the pipeline stops depending on the separate `/prospect` WebFetch path for branding. Logo: `og:image` → `apple-touch-icon` → header/nav logo `<img>`. Colour: best-effort (`<meta theme-color>`, CSS custom props, header/CTA inline styles) — approximate, InSites samples rendered pixels; flag low confidence.	shallow
Page identity (PERSIST these): `title`, `meta_description`, `meta_keywords`, `h1`	Extracted today but thrown away on the prospect crawl write — no column, not in `crawl_raw`. Persist for advisor/UI display, on-page keyword seeds, and title/NAP checks.	shallow
Ecommerce depth: `has_checkout`, `platform`, `product_count`	Calibrates GetLocal CSS vs SayMore Retail vs eCommerce add-on.	deep
Performance proxy: `ttfb_ms`, `html_bytes`, `image_count`	Speed is part of Google's quality bar; full Core Web Vitals is too heavy, this is the cheap stand-in.	deep

(We already capture: title/meta, features has_booking/has_ecommerce/has_blog/has_gallery/has_chat/has_forms/has_phone/detected_cms, wordCount, pageCount, contentPageCount, sitemap, technologies, structured-data/JSON-LD, payment processors, AI-readiness, social links. Keep all of those.)

"Capture" ≠ "persist". The items above are returned by crawlWebsite(), but the prospect crawl (prospect-crawl-enrich.js) only writes what's in SHALLOW_COLS / crawl_raw — so title / meta / h1 are captured live yet stored nowhere on PROSPECT_LISTINGS today. Every new field below must land in a typed crawl_* column or rawBlob, or it's lost on the book write (see Where it plugs in).

Suggested fingerprint shape

{
  "crawl_schema_version": 2,
  "presence": {
    "https": true,
    "mobile_friendly": true,
    "page_count": 7,
    "content_page_count": 5,
    "service_page_count": 4,
    "location_page_count": 1,
    "trust_pages": { "about": true, "contact": true, "privacy": true, "terms": false },
    "lead_capture": { "contact_form": true, "booking_vendor": "Calendly", "click_to_call": true, "whatsapp": false, "live_chat_vendor": null, "quote_cta": true },
    "ad_readiness": { "ga4": true, "gtm": false, "google_ads_tag": false, "meta_pixel": true, "conversion_tracking": false },
    "freshness": { "latest_post_date": "2025-11-02", "post_count": 12, "copyright_year": 2026 },
    "reputation_onsite": { "testimonials": true, "google_reviews_widget": false, "aggregate_rating_schema": true, "map_embed": true, "nap": { "name": "...", "address": "...", "phone": "..." } },
    "identity": { "title": "...", "meta_description": "...", "h1": "...", "logo_url": "https://site.ie/logo.png", "favicon_url": "https://site.ie/favicon.ico", "brand_colors": { "primary": "#0b1e3f", "secondary": null, "confidence": "low" } },
    "ecommerce": { "has_checkout": false, "platform": null, "product_count": null },
    "performance": { "ttfb_ms": null, "html_bytes": null, "image_count": null },
    "landing_page_quality": 0.72
  }
}

landing_page_quality = a derived 0–1 composite of https + mobile + trust pages + content depth (the ad-appropriateness signal). It can be computed downstream if you'd rather the crawl emit only the raw components.

Where it plugs in

crawl-website.js — add presence (and crawl_schema_version) to the crawlWebsite() return object, alongside features / technical / aiReadiness / social.
compose-deal-brief.js — already passes the crawl response through as brief.website; the new fields flow in with no further change.
competitor-enrich — route through the same extractor so the dominators' fingerprints match the prospect's exactly.
prospect-crawl-enrich.js (persistence) — the book sweep writes only SHALLOW_COLS + crawl_raw. To make any presence/identity field survive on PROSPECT_LISTINGS (not just the live return), give it either a typed crawl_* column or an entry in rawBlob(). Suggested: typed columns for the high-value descriptive fields (crawl_title, crawl_meta_description, crawl_h1, crawl_logo_url, crawl_brand_primary) since the advisor/UI read them directly; the rest of the presence block folds into crawl_raw. All identity fields are homepage-derived → set on the shallow write, no deep pass needed. Bump crawl_schema_version when these land.

FCR Dashboard documentation · generated from docs/ · keep counts verified, not guessed.