Website Crawl — Presence-Fingerprint Fields
For: whoever's extending the website crawl. From: dashboard team (Cathal). Date: 2026-05-24.
Code: worker/src/handlers/crawl-website.js (the crawlWebsite() return object).
Context: EARLY_STAGE_OPPORTUNITY_SCAN_SPEC_2026-05-24.md + docs/opportunity-scan-class-benchmark-input.md.
Why we're looking at this
We're building an early-stage opportunity scan: for a prospect we haven't spoken to yet, look at their real digital footprint and recommend the right FCR product to lead with. The guiding rule is competitive-relative — a prospect needs to be better than the businesses that dominate their category locally, not the best in the world. So we recommend the minimum product tier that clears the local/metro benchmark for their class.
That model only works if we can answer, per category and area: "what does the dominant set actually have, and where does this prospect fall short?" The website crawl is the fact source for both sides of that comparison — we crawl the prospect, and we crawl the competitors who dominate their patch. So the crawl needs to capture the same digital-presence dimensions, the same way, for everyone, so the comparison is apples-to-apples and we can compute a per-class benchmark to beat.
This brief lists the fields to add and the one architectural rule that makes the comparison valid.
The architectural rule (this is the important bit)
Emit one canonical "presence fingerprint", identical for prospect and competitor, shallow and deep. Deep-only fields come back null on a shallow crawl — never absent. Then computing "the dominant set's average" is just an aggregation over the same keys.
- Same extractor for the prospect crawl AND competitor-enrich. competitor-enrich must call the same code, not roll its own — otherwise the two sides aren't comparable.
- Emit raw counts, not just booleans, wherever we need a class median to beat:
page_count,service_page_count,location_page_count,on_site_review_count. - Add
crawl_schema_version(bump on any shape change) so re-crawls and backfills stay comparable, and the brief knows when a re-crawl is due. - Respect the budget split: shallow crawls run in the ~15s compose path, so shallow = homepage-HTML-only (regex/DOM, no extra fetches). Anything needing per-page or extra requests goes to the deep/full path.
Fields to add
| Field(s) | Why — which decision / benchmark it feeds | Depth |
|---|---|---|
service_page_count, location_page_count on shallow (today deep-only) |
The core "BizSite vs SitePro Multi vs SP20, sized to service range + locality" call. The book-wide scan runs shallow, so these must be available without a deep crawl. | shallow (sitemap/nav) |
Landing-page-quality composite: https, mobile_friendly (viewport), trust pages (about/contact/privacy/terms), content depth |
Operationalises the principle that Google weighs a site to judge ad-appropriateness. Becomes the ads gate — don't recommend ads until the site clears the bar. | shallow |
Ad-readiness tags: ga4, gtm, google_ads_tag, conversion_tracking (Meta Pixel already detected) |
"Already invests / is ad-ready" → SEA/Optimiser fit. Absence = tracking-gap pitch. | shallow |
Lead-capture surface: contact_form, booking_vendor, click_to_call (tel: links), whatsapp, live_chat_vendor, quote_cta |
"Has a site but it can't convert" opportunities. We have has_forms/has_booking/has_chat today but not which tool or whether it actually captures. | shallow |
Content freshness: latest_post_date, post_count, copyright_year |
A stale site ≠ a maintained one. Gates "personal development → active blog" and flags abandoned sites. | shallow |
On-site reputation: testimonials, google_reviews_widget, aggregate_rating_schema, map_embed, on-site nap (name/address/phone) |
SayMore signals + NAP-consistency check against the GBP. Prefer JSON-LD LocalBusiness as the NAP source where present (cleaner than page-text scraping). |
shallow |
Brand / identity assets: logo_url, favicon_url, brand_colors (primary/secondary hex) |
The deep-crawl-mimics-InSites gap — InSites DF returns logo + colour_scheme. Feeds the auto-built proposal/deck (client_brand block) so the pipeline stops depending on the separate /prospect WebFetch path for branding. Logo: og:image → apple-touch-icon → header/nav logo <img>. Colour: best-effort (<meta theme-color>, CSS custom props, header/CTA inline styles) — approximate, InSites samples rendered pixels; flag low confidence. |
shallow |
Page identity (PERSIST these): title, meta_description, meta_keywords, h1 |
Extracted today but thrown away on the prospect crawl write — no column, not in crawl_raw. Persist for advisor/UI display, on-page keyword seeds, and title/NAP checks. |
shallow |
Ecommerce depth: has_checkout, platform, product_count |
Calibrates GetLocal CSS vs SayMore Retail vs eCommerce add-on. | deep |
Performance proxy: ttfb_ms, html_bytes, image_count |
Speed is part of Google's quality bar; full Core Web Vitals is too heavy, this is the cheap stand-in. | deep |
(We already capture: title/meta, features has_booking/has_ecommerce/has_blog/has_gallery/has_chat/has_forms/has_phone/detected_cms, wordCount, pageCount, contentPageCount, sitemap, technologies, structured-data/JSON-LD, payment processors, AI-readiness, social links. Keep all of those.)
"Capture" ≠ "persist". The items above are returned by crawlWebsite(), but the prospect crawl (prospect-crawl-enrich.js) only writes what's in SHALLOW_COLS / crawl_raw — so title / meta / h1 are captured live yet stored nowhere on PROSPECT_LISTINGS today. Every new field below must land in a typed crawl_* column or rawBlob, or it's lost on the book write (see Where it plugs in).
Suggested fingerprint shape
{
"crawl_schema_version": 2,
"presence": {
"https": true,
"mobile_friendly": true,
"page_count": 7,
"content_page_count": 5,
"service_page_count": 4,
"location_page_count": 1,
"trust_pages": { "about": true, "contact": true, "privacy": true, "terms": false },
"lead_capture": { "contact_form": true, "booking_vendor": "Calendly", "click_to_call": true, "whatsapp": false, "live_chat_vendor": null, "quote_cta": true },
"ad_readiness": { "ga4": true, "gtm": false, "google_ads_tag": false, "meta_pixel": true, "conversion_tracking": false },
"freshness": { "latest_post_date": "2025-11-02", "post_count": 12, "copyright_year": 2026 },
"reputation_onsite": { "testimonials": true, "google_reviews_widget": false, "aggregate_rating_schema": true, "map_embed": true, "nap": { "name": "...", "address": "...", "phone": "..." } },
"identity": { "title": "...", "meta_description": "...", "h1": "...", "logo_url": "https://site.ie/logo.png", "favicon_url": "https://site.ie/favicon.ico", "brand_colors": { "primary": "#0b1e3f", "secondary": null, "confidence": "low" } },
"ecommerce": { "has_checkout": false, "platform": null, "product_count": null },
"performance": { "ttfb_ms": null, "html_bytes": null, "image_count": null },
"landing_page_quality": 0.72
}
}
landing_page_quality = a derived 0–1 composite of https + mobile + trust pages + content depth (the ad-appropriateness signal). It can be computed downstream if you'd rather the crawl emit only the raw components.
Where it plugs in
crawl-website.js— addpresence(andcrawl_schema_version) to thecrawlWebsite()return object, alongsidefeatures/technical/aiReadiness/social.compose-deal-brief.js— already passes the crawl response through asbrief.website; the new fields flow in with no further change.- competitor-enrich — route through the same extractor so the dominators' fingerprints match the prospect's exactly.
prospect-crawl-enrich.js(persistence) — the book sweep writes onlySHALLOW_COLS+crawl_raw. To make anypresence/identityfield survive onPROSPECT_LISTINGS(not just the live return), give it either a typedcrawl_*column or an entry inrawBlob(). Suggested: typed columns for the high-value descriptive fields (crawl_title,crawl_meta_description,crawl_h1,crawl_logo_url,crawl_brand_primary) since the advisor/UI read them directly; the rest of thepresenceblock folds intocrawl_raw. Allidentityfields are homepage-derived → set on the shallow write, no deep pass needed. Bumpcrawl_schema_versionwhen these land.
FCR Dashboard documentation · generated from docs/ · keep counts verified, not guessed.