Prospect Hub — target data model (listing-grain spine, place / website dims)

Design doc for refactoring the prospect data estate into a clean hub-and-dimension model. Today PROSPECT_LISTINGS is a wide table that mixes spine, place attributes, website attributes, crawl-state and provenance in one place.

Core principle (set by Cathal): everything hangs off a listing_id. No listing, no outreach. NOFs are not a parallel population — they are pre-listings that must be converted into listings before they can be worked. A prospect must have a subscriber_id and should have a listing_id. place_id and website are dimensions hung off the listing.

Status: DRAFT — for review. Author: Cathal + Claude, 2026-05-31. Dataset: listingmanager-1529856313699.fcr_operations (+ SUPERSET_MASTER reads).

1. Why refactor

Current pain (observed during the 2026-05 prospect-DB + crawl-tail work):

One wide listing-grain table does five jobs. PROSPECT_LISTINGS holds spine, place attributes, website attributes, per-listing crawl-state and provenance. Crawl-state is per-listing but the crawler selects by crawl_root (host) — a structural mismatch that forced host-group estimation hacks and per-listing fan-out.
No URL-source provenance. GP-vs-GMB URL source is not a flag; it had to be derived by joining four GBP/SerpAPI tables + dim_listings (scripts/crawl-tail/prune-candidates.sql).
Place attributes scattered across GBP_INFORMATION, serp_gbp_snapshots, serp_archive_extracts_*, v_gmb_locations, Pleper — no single resolved place record.
Fragile links. The place_id↔listing link lives inside the pipe-delimited urls_urls string; parsing it produced the 329 wrong-place_id linkages and malformed crawl_roots.
NOFs have no path to becoming workable. They sit as net-new rows with no listing_id/subscriber_id, so they can't enter the outreach funnel.

2. Grain — DECIDED: listing_id

The spine is keyed by listing_id. Hard rules:

Everything hangs off a listing — no outreach without a listing. Creating the listing is the prerequisite action of the funnel.
Every prospect carries a subscriber_id (the on-GP-file key) and should carry a listing_id. A NOF gets both minted when it is converted to a listing — that is precisely what "on file" means.
NOF = a pre-listing staging state, not a separate table of permanent residents. Unmatched NOFs are promoted into listings (noindex/nofollow) to enter the funnel. So effectively prospect = listings, where NOFs have been converted in.
place_id and website are dimensions hung off the listing (M:N), carrying the Google-side and web-side attributes respectively.

(An earlier draft proposed a synthetic business-grain key — rejected: the funnel, the inbound lead, and CRM are all listing-anchored.)

3. Target model

  SUPERSET listings (GP/GetLocal, on-file) ┐
                                            ├──▶  PROSPECT SPINE  ───────────┐
  NOF (converted into listings)  ───────────┘   1 row / listing_id           │
                                                 listing_id (PK)             │
                                                 subscriber_id (required)    │
                                                 status (lifecycle, §5)      │
                                                 segment / category / county │
                                                 is_active_client            │
                                                        │                    │
                        ┌────────────────────────────────┼────────────────────┘
                        │ xref (M:N)                      │ xref (M:N)
                        ▼                                 ▼
                  dim_place (1/place_id)            dim_website (1/domain)
                   ├ name/addr/cat/hours/phone        ├ crawl_root (PK)
                   ├ website (resolved)               ├ crawl-state: next_at, attempts, backoff
                   ├ reviews/rating (Pleper)          ├ reachability: has_A / nxdomain / live / 530
                   ├ source + observed_at /attr       ├ attrs: tech, pixels, schema, AI-blocked,
                   ├ verified, superseded_by          │        booking/ecom/blog, TTFB, pages
                   └ coverage: serpapi/pleper/         └ source + observed_at
                     managed_gmb flags

  FACTS (time-series — NOT in the dims):
    fact_gmb_monthly   (place_id, month, impressions, calls, clicks, directions…)  ← managed GMB API
    fact_web_monthly   (crawl_root, month, GSC clicks/impr, GA4 sessions…)

3.1 Entities

prospect spine — listing_id PK, subscriber_id (required), status (§5), segment/category/county, is_active_client. Holds nothing that belongs to a place or website — those are dims.
dim_place — one resolved row per place_id, merging SerpAPI + Pleper + managed-GMB attributes (current, slowly-changing). Per-source source + observed_at; verified; superseded_by (place_id churn); coverage flags has_serpapi/has_pleper/has_managed_gmb.
dim_website — one row per domain (crawl_root PK). Crawl-state moves here (crawl_next_at, crawl_slow_attempts, backoff) — host grain matches the crawler's selection. Reachability (has_A, nxdomain, live, last DoH check) + crawl attributes.
xref tables — xref_listing_place(listing_id, place_id, link_type, confidence, source, verified) and xref_listing_website(listing_id, crawl_root, link_type, confidence, source). Relationships are M:N (one website ↔ many listings; one place_id ↔ many listings — the 3,609 multi-listing + 329 wrong-link cohorts live here as low-confidence/unverified rows). urls_urls keeps the place_id link as the GP-facing artifact but is no longer the join mechanism.
facts — monthly metrics keyed by place_id (GMB) / crawl_root (web). Never folded into the dims.
dim_ad_presence (future — see §11) — one row per listing's ad footprint: which platforms it advertises on (Google / Meta), active flag, first/last seen, ad count, sample creatives + landing URLs.

4. NOF → listing lifecycle (the funnel)

NOF conversion is listing creation. State machine on the spine status:

 nof_unmatched
     │  create listing: mint subscriber_id + listing_id, set NOINDEX/NOFOLLOW
     ▼
 listing_draft ──find email──▶ outreach_confirm ────auto-publish after set time────▶ published
   (noindex)                   (ask to confirm/                                    (indexed,
                                correct DATA, not sell)                             unless deleted)
                                        │
                  ┌─────────────────────┼──────────────────────┐
              "delete"               "amend"                 silence
                  │                      │                       │
                delete            nurture  (CHANGED)           nurture
                                  was: sales outreach.
                                  Too early — only the
                                  confirm action belonged
                                  at this stage.

Confirm stage sells nothing — it asks the business to confirm/correct their listing data.
amend → nurture (changed). Previously an amend triggered sales outreach; that proved too early. Only the confirm action was appropriate at that point, so amend now routes to nurture, same as silence. (Both amend and silence → nurture; only delete removes.)
noindex/nofollow during the confirm window; auto-publish (index) after the set time unless the prospect asked to delete.
A listing that points at a dead domain (from dim_website.dns_status='nxdomain') is flagged for a publishing-platform action, never deleted — routed by surface via website_action_queue (remove/replace the dead URL on the goldenpages.ie listing, or flag the GMB profile + sales hook). A dead URL on a live GP/GMB surface is a defect to action on that platform, not a row to remove.

5. The #1-inbound-lead risk (LIVE DEBATE — decide before mass auto-creation)

FCR's top inbound sales lead is listing creation — a business creating/claiming a listing is a high-intent signal we currently catch and convert. Auto-creating noindex listings for the whole NOF pool risks cannibalising that inbound: if a listing already exists, the prospect may no longer perform the high-intent create action.

Points to weigh (unresolved):

noindex/nofollow keeps the auto-listing out of Google search, so a self-googling prospect still finds nothing of theirs there → may still create one (inbound preserved). But it is visible directly on goldenpages.ie, which could short-circuit the create.
Consider excluding high-intent / high-value categories from auto-creation, or holding them in listing_draft without publishing.
Instrument it: run a holdout and compare the inbound listing-creation rate with/without auto-creation before rolling out at scale.

This is the central trade-off of the whole NOF programme — the data model should support measuring it (status history + source of each listing: auto-created vs self-created).

6. Vector index off the spine (for outreach)

Grain = the listing (= prospect spine). Embed one doc per listing.
Embed the gaps, lead with the hooks. Doc text leads with outreach levers — no website / dead site / AI-invisible (GPTBot-403) / no reviews / weak tech (no booking, no schema) / not on Google — then name, category, county, services, review summary.
Reuse existing Vectorize infra (deal-vector / similar-sites pattern); re-embed on spine change. Interrogate it to design outreach copy per cohort.

7. Cross-cutting

Provenance everywhere. Every dim attribute carries source + observed_at (+ confidence where blended). Resolution = source priority (canonical GBP > serp_gbp_snapshots > managed GMB > archive harvest), materialised — not an opaque COALESCE. Repeatedly burned by stale blends (Yext cache, human-reviewed GMB→PL place_id gap).
place_id is not eternal. Google merges/closes/renumbers. verified + superseded_by + periodic re-resolve, or the place hub rots.
Verified-data publish gate. A published listing must render only tool-verified place/website data we actually fetched — never inferred competitor names or constructed URLs (the CA-Recruitment fabrication rule). Also protects GP domain quality from thin auto-pages (noindex until published).
Clustering / cost. dim_website CLUSTER BY (crawl_next_at, crawl_root); dim_place CLUSTER BY place_id; spine CLUSTER BY (status, segment). Removes the unclustered-full-scan-per-cron cost.
Writes stay governed. Spine/dims are the read surface for advisor + card + dashboard + marketing; production writes via worker/cron, marketing via n8n (per CLAUDE.md). Dims read-only to marketing.

8. Column → dim mapping (reshape, don't rebuild)

PROSPECT_LISTINGS (88K+ pids) is re-shaped, not discarded:

Today (PROSPECT_LISTINGS / source)	Target
`subscriber_id`, `segment`, classification, `is_active_client`	spine (+ mint `listing_id` for NOFs)
`place_id`, `place_id_source`	spine→place via `xref_listing_place` (`.source`)
`crawl_root`, `crawl_next_at`, `crawl_slow_attempts`, `crawl_error`, `crawl_reachable`	dim_website (crawl-state + reachability)
`crawl_has_*` pixels, `has_schema`, `crawl_ai_bots_blocked`, `crawl_ai_edge_blocked`, `crawl_page_count`, `crawl_response_ms`	dim_website (attributes)
`GBP_INFORMATION`/`serp_gbp_snapshots`/`serp_archive_extracts_*`/`v_gmb_locations`	dim_place (COALESCE w/ source priority + observed_at)
`GBP_REVIEWS` / Pleper	dim_place (reviews/rating/attrs)
`2_GMB_Business_Metrics`/`v_gmb_*` monthly	fact_gmb_monthly
`dim_listings_with_subscriber_id.urls_urls/urls_types/gl_crawlDomain`	listing GP website → xref + kept as GP artifact

The 530/NXDOMAIN dead-domain work plugs in: the verdict is one dim_website.dns_status value that every listing on that domain inherits (no per-listing loop), and website_action_queue flags each for the right publishing-platform action — never a delete.

9. Open decisions

Inbound-lead trade-off (§5) — auto-create scope / holdout — blocks mass NOF creation.
Publish semantics — does "auto-publish after set time" mean index on goldenpages, and does it ever stay noindex?
Identity for multi-location — one subscriber_id with many listing_ids, or many subscribers? (drives xref + CRM rollup.)
dim_website key — registered domain vs crawl_root host (www vs apex). Recommend registered domain, normalise variants.
Facts — reuse SUPERSET_MASTER (2_GMB_Business_Metrics etc.) via view, or copy into fcr_operations. Recommend reuse.
Ads-transparency join key (§11) — advertiser identity → listing, via landing domain (Google) / Facebook Page (Meta). Needs fuzzy identity resolution.

10. Suggested build order

dim_website first (clear win): migrate crawl-state off PROSPECT_LISTINGS, wire DoH reachability, repoint crawler to host grain — feeds the dead-domain flag + website_action_queue.
dim_place with source-priority + provenance (formalise prune-candidates.sql's COALESCE).
Spine + xref, backfill from PROSPECT_LISTINGS; mint subscriber_id/listing_id for converted NOFs.
Lifecycle status + NOF→listing funnel (incl. auto-create vs self-create source flag for the §5 measurement).
Vector index off the spine.

11. Future data sources — ads transparency (Google + Meta)

To be added later: Google Ads Transparency Center + Meta Ads Library / Transparency Center. These reveal whether a prospect is already advertising and what they run.

Signal value (high for sales):
- Already advertising = has budget + is marketing-active → warm for FCR paid services, or a competitor-displacement target.
- Not advertising while category peers are → a clean gap pitch.
- Ad creatives + landing pages + messaging feed the outreach vector (what they push, where they send traffic).
Lands as dim_ad_presence (attributes per listing) + optionally a fact_ad_activity time-series (ads run over date ranges, active/paused). Provenance + observed_at like every other dim.
The hard part is the join key (open decision): the transparency centres key on advertiser identity, not place_id —
- Google Ads Transparency → verified advertiser, usually domain-anchored → join via landing/display domain → dim_website → listing.
- Meta Ads Library → Facebook Page identity (has an official API) → join via the Page URL we already capture in urls_urls (FACEBOOK type), or page-name match.
- Both need an identity-resolution step + confidence, same pattern as place_id resolution; budget for fuzzy matching, not a clean key.
Access/ToS note: Meta Ad Library has an official API; Google's Transparency Center is more scrape-bound — confirm sanctioned access before building, and treat creatives as public-but-attributed data.

12. Business summary generation (embedding core + listing content + outreach)

A generated business summary / long-description per prospect is the semantic core of the vector index (drives look-alike retrieval), AND a reusable asset: it doubles as the NOF listing's long-description (a published noindex listing needs body content) and the outreach personalisation raw material. Generate once, reuse three ways. Stored on the spine: PROSPECT_LISTINGS.business_summary (+ business_summary_at, business_summary_source) — added 2026-05-31. The reindex handler already leads its embedding doc with it (falls back to structured signals until generated).

Inputs (ALL tool-fetched — verified-only, never invented):

GBP_INFORMATION.description — the business's own GBP blurb (gold). Sparse now (~6,729) — grows with Pleper.
GBP_SERVICES (service_name/service_description/service_price) — what they offer.
PROSPECT_LISTINGS.crawl_raw (homepage title/meta/text) — site content; ~56,688 have usable text (crawl_word_count>50).
primary_category/gp_category, city/county, review_count/rating.

Prioritisation tiers (of ~210K universe):

Tier 1 — LLM summary (~58,510 = has GBP description OR crawl text): the addressable/outreachable set. Haiku-class model.
Tier 2 — templated, no LLM (~150K thin): deterministic "{name} is a {category} in {town}, {county}." — free, no fabrication risk.
Defer dead-domain/junk.

Model + guardrail: Haiku-class. Prompt: "Summarise THIS business in ~80–150 words, third-person, factual, from the PROVIDED GBP/website/services text ONLY. Do NOT invent services, claims, awards, or competitors. If inputs are thin, output only what's verifiable." Enforces the CA-Recruitment no-fabrication rule — same standard required to publish the NOF listing. Record business_summary_source (which inputs were present) for auditability.

Cost: Tier-1 ~58.5K × ~1.2K tokens ≈ ~$90 on Haiku (one-off; regenerate only on input change). Tier-2 free.

Pipeline: a batch handler (dashboard-prospect-summary-generate) — pull N prospects WHERE business_summary IS NULL AND has inputs, assemble GBP desc + services + crawl text, call model, write back; paginated/driver like the crawl + reindex drivers. Idempotent; regenerate when crawl/GBP refreshes (deal-brief-style staleness). Feeds embedding (re-embed incrementally) + listing content + outreach.

Related: docs/hubspot-data-architecture.md, docs/bigquery-and-sync.md, docs/prospect-listings-composition-2026-05-28.md, scripts/crawl-tail/.

FCR Dashboard documentation · generated from docs/ · keep counts verified, not guessed.