Prospect Enrichment & Opportunity Index — as built

How a raw prospect listing is turned into a scored, searchable opportunity: the crawl → host-grain web dimension → paid enrichment (SerpAPI / Pleper / Ahrefs) → review mining → vector index → opportunity search. This documents what is live as of 2026-06-01 (counts are dated snapshots, not live).

Code: worker/src/handlers/prospect-*.js, scripts/crawl-tail/*, scripts/prospect-resolve/*. Related: Prospect Hub — Target Data Model, Prospect DB — Composition, Website Crawl, Website Crawl — Presence Fingerprint (proposed).

The pipeline at a glance

 PROSPECT_LISTINGS (spine, ~210K in-universe, listing grain)
        │
        ├─ place_id resolve (SerpAPI)  ── no-place_id rows → real Google place_id
        │
        ├─ crawl (GoldenPagesBot, free) ──▶ dim_website (host grain, 1/domain)
        │        │                              ├─ crawl-state + reachability + web attrs
        │        │                              └─ dns_status (DoH) → dead-domain detection
        │        │                                       └─▶ website_action_queue (flag, never delete)
        │
        ├─ Pleper (paid)   ── GBP reviews / rating / photos / services
        │
        ├─ Ahrefs (paid)   ──▶ dim_seo  (DR / org keywords / traffic for blocked-but-live sites)
        │
        └─ review mining (Haiku) ──▶ prospect_review_insights (unlisted services / hours / themes)
                 │
                 ▼
        fcr-prospect-opportunities  (Vectorize, 1 vector / listing)
                 │  query-time enrichment: dim_seo + review_gaps
                 ▼
        prospect-opportunity-search  →  find_prospect_opportunities (advisor tool)

Two cost tiers: the crawl is free and runs on every row with a real website; SerpAPI / Pleper / Ahrefs are paid and gated by enrich_enabled.

1. `dim_website` — host-grain web dimension

The crawler selects work by host (crawl_root), but crawl-state lived per-listing on the spine — a structural mismatch. dim_website (scripts/crawl-tail/build-dim-website.sql) is one row per domain, deduped freshest-crawl-wins, CLUSTER BY (crawl_next_at, crawl_root). A shared website is judged once and every listing on it inherits the verdict (e.g. the dead-domain prune), with no per-listing fan-out.

It carries crawl-state (crawl_next_at, crawl_slow_attempts), reachability (crawl_reachable, crawl_error, has_crawl_data), the full web-attribute set (CMS, SSL, pixels, schema, page counts, AI-bot-blocked flags, TTFB), and dns_status / dns_checked_at (filled by the DoH pass below). Richer presence-fingerprint attributes (trust-page composite, lead-capture vendor, freshness, performance) are proposed, not built — see Website Crawl — Presence Fingerprint (proposed).

2. Dead-domain detection (DoH) → `website_action_queue`

A 5xx/530 HTTP error does not mean a domain is dead — a Cloudflare 530/526 is often a live origin throwing errors or broken SSL. The authoritative signal is DNS. scripts/crawl-tail/doh-fill-dim-website.cjs runs a DNS-over-HTTPS pass over the error cohort and sets dns_status:

`dns_status`	Meaning
`has_A`	resolves — the HTTP error was a block / SSL / CF issue, not dead
`nxdomain`	confirmed dead — domain does not resolve
`registered_noA`	registered, no A record (parked)
`doh_error`	DoH inconclusive

Confirmed-dead (nxdomain) domains flow into fcr_operations.website_action_queue (build-website-action-queue.sql) — a flag/worklist, never a delete. Each dead URL is routed to the right publishing-platform action by where it is published:

`action_type`	Trigger	Action
`flag_gmb_dead_site`	dead URL is the GBP profile's website	flag GMB — live Google listing points at a dead site (owner-fix + sales hook)
`remove_or_replace_dead_url_gp`	dead URL is the goldenpages.ie listing's website	remove/replace the dead URL on the GP listing (directory quality)
`both`	published on both	both of the above
`review_gp_listing`	on GP file, neither surface matches	manual review
`flag_only_no_platform`	net-new prospect, no publishing surface	flag only

escalate_active_client flips any row to escalate when a paying client has a dead site. concede-nxdomain.sql then pushes crawl_next_at +180d so the cron stops re-hammering confirmed-dead hosts while still re-checking twice a year (a lapsed domain can be re-registered).

Crawl status (2026-06-01): complete across all live sites — of ~80,270 unique domains, 62,630 (78%) cleanly crawled. The ~17.6K failures break out by HTTP response:

Bucket	Domains	Dominant codes
5xx	8,675	530 (5,147 — CF origin dead) · 526 (2,450 — broken TLS) · 503/500 (~~550) · 52x (~~500)
4xx	5,976	403 (4,348 — bot-block on a live site) · 404 (1,344) · 429 (85)
Timeout	2,352	slow / unresponsive hosts
robots_blocked	573	site disallows crawlers

DoH-checked ~14.5K of the error cohort → 3,775 confirmed NXDOMAIN (the genuinely-dead subset of the 5xx) routed into the action queue (incl. ~43 active-client escalations). The 403 (live-but-blocked) cohort is where Ahrefs steps in (§5) — we can't fetch them, but they're real businesses worth an SEO read._

3. SerpAPI — place_id resolution

Many on-file listings carry no real Google place_id (just an LST: stub). scripts/prospect-resolve/* resolves them via SerpAPI under a name + location + category-class gate (phone-alone is dangerous — call centres, shared lines, chain dial-trees), scored on confidence / phone-match / website-match / geo-distance / category-match. CATEGORY_MAPPING (GP→Google) is the authoritative source-side classifier. Archive re-fetch (/searches/{id}.json) is free — always persist search_metadata.id. Resolved place_ids attach real Google data and unlock Pleper/review enrichment.

4. Pleper — GBP enrichment (paid)

worker/src/handlers/prospect-pleper-refresh.js pulls GBP reviews, rating, photo count and services for resolved place_ids. This is what populates GBP_REVIEWS / GBP_SERVICES (the inputs to review mining) and the review/photo signals used in the embedding and opportunity scoring. Gated by enrich_enabled.

GBP basics are held on the spine (review_count / rating / has_serp_data), not just in the GBP_INFORMATION sub-table (~13.5K, descriptions only). Coverage as of 2026-06-01: 84,841 of 118,607 real-place_id rows carry has_serp_data=TRUE (review_count 86,114, rating 82,567) — effectively all listings and place_id NOFs that have been enriched. The ~33K real-place_id rows still without it are recent SERP-archive harvest held behind the enrich_enabled gate.

5. Ahrefs — `dim_seo` for blocked-but-live sites (paid)

worker/src/handlers/prospect-seo-enrich.js fills fcr_operations.dim_seo via Ahrefs batch-analysis (100 domains/call). Ahrefs has its own crawl, so it reads the SEO profile of sites our crawler is 403'd/blocked on. Cohort: dim_website not-reachable AND dns-alive (403/forbidden error OR dns_status='has_A') AND not already in dim_seo — gated to live so units are never spent on dead domains.

Fields (per domain): domain_rating, url_rating, org_keywords, org_keywords_1_3 (top-3 visibility), org_traffic, org_cost, paid_keywords (already on Google Ads = warm lead), refdomains, backlinks. Dual purpose: enrichment for an unreachable site and an opportunity signal — low DR + zero organic keywords on a live site is a clean SEO pitch.

Snapshot 2026-06-01: 10,570 domains enriched (source='ahrefs_batch_live_blocked'); the defined live-blocked cohort is fully drained. ~59% rank for zero organic keywords; 392 already run Google Ads.

6. Review mining — `prospect_review_insights`

worker/src/handlers/prospect-review-mine.js reads deduped GBP_REVIEWS text (longest ~15 per place) and a Haiku model extracts, strictly from review text (invent nothing): services_mentioned, unlisted_services (mentioned but NOT on the GBP profile — the gap SayMore can correct), hours_signals, and themes (for outreach copy). Same verified-only standard as the NOF publish gate.

Snapshot 2026-06-01: 26,956 places mined (0 remaining); 93,642 unlisted-service gaps across 22,844 places.

7. `fcr-prospect-opportunities` — the vector index

worker/src/handlers/prospect-vector-reindex.js embeds one vector per in-universe listing (text-embedding-3-small, 1536-dim, cosine) into the fcr-prospect-opportunities Vectorize index. Driven offset-based over the ~210K spine (scripts/crawl-tail/resume-vector-reindex*.cjs).

The embedding doc leads with business_summary (the semantic core — see the data-model doc; generated separately and still being backfilled), then structured signals: business / category / segment / locality / listing status / web_state (no_website · dead_domain · unreachable · ai_blocked · thin_site · live) / GBP basics / derived opportunity hooks (digital greenfield, dead-site rebuild, AI-invisible, thin-site, no-reviews→SayMore). Metadata indexes the filter dims (b_segment, county, gate_band, web_state, is_active_client, has_listing) so retrieval can scope ("no-website Home Services in Cork we can auto-create").

dim_website joins by crawl_root to supply web_state and dns_status. The reindex deletes-by-id before insert, so an offset-overlap re-run is idempotent.

8. `prospect-opportunity-search` → `find_prospect_opportunities`

worker/src/handlers/prospect-opportunity-search.js is the query surface (advisor tool find_prospect_opportunities). It embeds the query, runs a metadata-filtered Vectorize search, then enriches the matched set at query time (not baked into the vector) with:

dim_seo — DR / organic keywords / traffic for each hit,
prospect_review_insights — review_gaps (unlisted_services + hours_signals) per place.

So each hit returns who + the opportunity (web gap, SEO gap, review gap) ready to action.

Operational notes

Self-draining drivers. Each enrichment handler selects its own "not-yet-done" cohort and reports remaining; the scripts/crawl-tail/drive-*.cjs drivers paginate until remaining=0. The reindex resumes by offset.
Cost discipline. Crawl is free; SerpAPI/Pleper/Ahrefs are unit-metered and gated. Rollup tables (dim_website, dim_seo) are built once and clustered — never scanned per-row in a loop (see BigQuery & Sync §7).
Verified-only. Every client-facing/embedded field is tool-fetched, never inferred — competitor names and external URLs only from a tool result (the CA-Recruitment no-fabrication rule).
Reindex after docs change is unrelated to this index — that's the docs-search bot; this is the prospect-opportunity index, re-embedded on spine/enrichment change.

FCR Dashboard documentation · generated from docs/ · keep counts verified, not guessed.

Prospect Enrichment & Opportunity Index — as built

The pipeline at a glance

1. dim_website — host-grain web dimension

2. Dead-domain detection (DoH) → website_action_queue

3. SerpAPI — place_id resolution

4. Pleper — GBP enrichment (paid)

5. Ahrefs — dim_seo for blocked-but-live sites (paid)

6. Review mining — prospect_review_insights

7. fcr-prospect-opportunities — the vector index