Prospect Enrichment & Opportunity Index — as built
How a raw prospect listing is turned into a scored, searchable opportunity: the crawl → host-grain web dimension → paid enrichment (SerpAPI / Pleper / Ahrefs) → review mining → vector index → opportunity search. This documents what is live as of 2026-06-01 (counts are dated snapshots, not live).
Code: worker/src/handlers/prospect-*.js, scripts/crawl-tail/*, scripts/prospect-resolve/*. Related: Prospect Hub — Target Data Model, Prospect DB — Composition, Website Crawl, Website Crawl — Presence Fingerprint (proposed).
The pipeline at a glance
PROSPECT_LISTINGS (spine, ~210K in-universe, listing grain)
│
├─ place_id resolve (SerpAPI) ── no-place_id rows → real Google place_id
│
├─ crawl (GoldenPagesBot, free) ──▶ dim_website (host grain, 1/domain)
│ │ ├─ crawl-state + reachability + web attrs
│ │ └─ dns_status (DoH) → dead-domain detection
│ │ └─▶ website_action_queue (flag, never delete)
│
├─ Pleper (paid) ── GBP reviews / rating / photos / services
│
├─ Ahrefs (paid) ──▶ dim_seo (DR / org keywords / traffic for blocked-but-live sites)
│
└─ review mining (Haiku) ──▶ prospect_review_insights (unlisted services / hours / themes)
│
▼
fcr-prospect-opportunities (Vectorize, 1 vector / listing)
│ query-time enrichment: dim_seo + review_gaps
▼
prospect-opportunity-search → find_prospect_opportunities (advisor tool)
Two cost tiers: the crawl is free and runs on every row with a real website; SerpAPI / Pleper / Ahrefs are paid and gated by enrich_enabled.
1. dim_website — host-grain web dimension
The crawler selects work by host (crawl_root), but crawl-state lived per-listing on the spine — a structural mismatch. dim_website (scripts/crawl-tail/build-dim-website.sql) is one row per domain, deduped freshest-crawl-wins, CLUSTER BY (crawl_next_at, crawl_root). A shared website is judged once and every listing on it inherits the verdict (e.g. the dead-domain prune), with no per-listing fan-out.
It carries crawl-state (crawl_next_at, crawl_slow_attempts), reachability (crawl_reachable, crawl_error, has_crawl_data), the full web-attribute set (CMS, SSL, pixels, schema, page counts, AI-bot-blocked flags, TTFB), and dns_status / dns_checked_at (filled by the DoH pass below). Richer presence-fingerprint attributes (trust-page composite, lead-capture vendor, freshness, performance) are proposed, not built — see Website Crawl — Presence Fingerprint (proposed).
2. Dead-domain detection (DoH) → website_action_queue
A 5xx/530 HTTP error does not mean a domain is dead — a Cloudflare 530/526 is often a live origin throwing errors or broken SSL. The authoritative signal is DNS. scripts/crawl-tail/doh-fill-dim-website.cjs runs a DNS-over-HTTPS pass over the error cohort and sets dns_status:
dns_status |
Meaning |
|---|---|
has_A |
resolves — the HTTP error was a block / SSL / CF issue, not dead |
nxdomain |
confirmed dead — domain does not resolve |
registered_noA |
registered, no A record (parked) |
doh_error |
DoH inconclusive |
Confirmed-dead (nxdomain) domains flow into fcr_operations.website_action_queue (build-website-action-queue.sql) — a flag/worklist, never a delete. Each dead URL is routed to the right publishing-platform action by where it is published:
action_type |
Trigger | Action |
|---|---|---|
flag_gmb_dead_site |
dead URL is the GBP profile's website | flag GMB — live Google listing points at a dead site (owner-fix + sales hook) |
remove_or_replace_dead_url_gp |
dead URL is the goldenpages.ie listing's website | remove/replace the dead URL on the GP listing (directory quality) |
both |
published on both | both of the above |
review_gp_listing |
on GP file, neither surface matches | manual review |
flag_only_no_platform |
net-new prospect, no publishing surface | flag only |
escalate_active_client flips any row to escalate when a paying client has a dead site. concede-nxdomain.sql then pushes crawl_next_at +180d so the cron stops re-hammering confirmed-dead hosts while still re-checking twice a year (a lapsed domain can be re-registered).
Crawl status (2026-06-01): complete across all live sites — of ~80,270 unique domains, 62,630 (78%) cleanly crawled. The ~17.6K failures break out by HTTP response:
| Bucket | Domains | Dominant codes |
|---|---|---|
| 5xx | 8,675 | 530 (5,147 — CF origin dead) · 526 (2,450 — broken TLS) · 503/500 ( |
| 4xx | 5,976 | 403 (4,348 — bot-block on a live site) · 404 (1,344) · 429 (85) |
| Timeout | 2,352 | slow / unresponsive hosts |
| robots_blocked | 573 | site disallows crawlers |
DoH-checked ~14.5K of the error cohort → 3,775 confirmed NXDOMAIN (the genuinely-dead subset of the 5xx) routed into the action queue (incl. ~43 active-client escalations). The 403 (live-but-blocked) cohort is where Ahrefs steps in (§5) — we can't fetch them, but they're real businesses worth an SEO read._
3. SerpAPI — place_id resolution
Many on-file listings carry no real Google place_id (just an LST: stub). scripts/prospect-resolve/* resolves them via SerpAPI under a name + location + category-class gate (phone-alone is dangerous — call centres, shared lines, chain dial-trees), scored on confidence / phone-match / website-match / geo-distance / category-match. CATEGORY_MAPPING (GP→Google) is the authoritative source-side classifier. Archive re-fetch (/searches/{id}.json) is free — always persist search_metadata.id. Resolved place_ids attach real Google data and unlock Pleper/review enrichment.
4. Pleper — GBP enrichment (paid)
worker/src/handlers/prospect-pleper-refresh.js pulls GBP reviews, rating, photo count and services for resolved place_ids. This is what populates GBP_REVIEWS / GBP_SERVICES (the inputs to review mining) and the review/photo signals used in the embedding and opportunity scoring. Gated by enrich_enabled.
GBP basics are held on the spine (review_count / rating / has_serp_data), not just in the GBP_INFORMATION sub-table (~13.5K, descriptions only). Coverage as of 2026-06-01: 84,841 of 118,607 real-place_id rows carry has_serp_data=TRUE (review_count 86,114, rating 82,567) — effectively all listings and place_id NOFs that have been enriched. The ~33K real-place_id rows still without it are recent SERP-archive harvest held behind the enrich_enabled gate.
5. Ahrefs — dim_seo for blocked-but-live sites (paid)
worker/src/handlers/prospect-seo-enrich.js fills fcr_operations.dim_seo via Ahrefs batch-analysis (100 domains/call). Ahrefs has its own crawl, so it reads the SEO profile of sites our crawler is 403'd/blocked on. Cohort: dim_website not-reachable AND dns-alive (403/forbidden error OR dns_status='has_A') AND not already in dim_seo — gated to live so units are never spent on dead domains.
Fields (per domain): domain_rating, url_rating, org_keywords, org_keywords_1_3 (top-3 visibility), org_traffic, org_cost, paid_keywords (already on Google Ads = warm lead), refdomains, backlinks. Dual purpose: enrichment for an unreachable site and an opportunity signal — low DR + zero organic keywords on a live site is a clean SEO pitch.
Snapshot 2026-06-01: 10,570 domains enriched (source='ahrefs_batch_live_blocked'); the defined live-blocked cohort is fully drained. ~59% rank for zero organic keywords; 392 already run Google Ads.
6. Review mining — prospect_review_insights
worker/src/handlers/prospect-review-mine.js reads deduped GBP_REVIEWS text (longest ~15 per place) and a Haiku model extracts, strictly from review text (invent nothing): services_mentioned, unlisted_services (mentioned but NOT on the GBP profile — the gap SayMore can correct), hours_signals, and themes (for outreach copy). Same verified-only standard as the NOF publish gate.
Snapshot 2026-06-01: 26,956 places mined (0 remaining); 93,642 unlisted-service gaps across 22,844 places.
7. fcr-prospect-opportunities — the vector index
worker/src/handlers/prospect-vector-reindex.js embeds one vector per in-universe listing (text-embedding-3-small, 1536-dim, cosine) into the fcr-prospect-opportunities Vectorize index. Driven offset-based over the ~210K spine (scripts/crawl-tail/resume-vector-reindex*.cjs).
The embedding doc leads with business_summary (the semantic core — see the data-model doc; generated separately and still being backfilled), then structured signals: business / category / segment / locality / listing status / web_state (no_website · dead_domain · unreachable · ai_blocked · thin_site · live) / GBP basics / derived opportunity hooks (digital greenfield, dead-site rebuild, AI-invisible, thin-site, no-reviews→SayMore). Metadata indexes the filter dims (b_segment, county, gate_band, web_state, is_active_client, has_listing) so retrieval can scope ("no-website Home Services in Cork we can auto-create").
dim_website joins by crawl_root to supply web_state and dns_status. The reindex deletes-by-id before insert, so an offset-overlap re-run is idempotent.
8. prospect-opportunity-search → find_prospect_opportunities
worker/src/handlers/prospect-opportunity-search.js is the query surface (advisor tool find_prospect_opportunities). It embeds the query, runs a metadata-filtered Vectorize search, then enriches the matched set at query time (not baked into the vector) with:
dim_seo— DR / organic keywords / traffic for each hit,prospect_review_insights—review_gaps(unlisted_services + hours_signals) per place.
So each hit returns who + the opportunity (web gap, SEO gap, review gap) ready to action.
Operational notes
- Self-draining drivers. Each enrichment handler selects its own "not-yet-done" cohort and reports
remaining; thescripts/crawl-tail/drive-*.cjsdrivers paginate untilremaining=0. The reindex resumes by offset. - Cost discipline. Crawl is free; SerpAPI/Pleper/Ahrefs are unit-metered and gated. Rollup tables (
dim_website,dim_seo) are built once and clustered — never scanned per-row in a loop (see BigQuery & Sync §7). - Verified-only. Every client-facing/embedded field is tool-fetched, never inferred — competitor names and external URLs only from a tool result (the CA-Recruitment no-fabrication rule).
- Reindex after docs change is unrelated to this index — that's the docs-search bot; this is the prospect-opportunity index, re-embedded on spine/enrichment change.
FCR Dashboard documentation · generated from docs/ · keep counts verified, not guessed.