Prospect Hub — target data model (listing-grain spine, place / website dims)
Design doc for refactoring the prospect data estate into a clean hub-and-dimension model. Today PROSPECT_LISTINGS is a wide table that mixes spine, place attributes, website attributes, crawl-state and provenance in one place.
Core principle (set by Cathal): everything hangs off a listing_id. No listing, no outreach. NOFs are not a parallel population — they are pre-listings that must be converted into listings before they can be worked. A prospect must have a
subscriber_idand should have alisting_id. place_id and website are dimensions hung off the listing.Status: DRAFT — for review. Author: Cathal + Claude, 2026-05-31. Dataset:
listingmanager-1529856313699.fcr_operations(+SUPERSET_MASTERreads).
1. Why refactor
Current pain (observed during the 2026-05 prospect-DB + crawl-tail work):
- One wide listing-grain table does five jobs. PROSPECT_LISTINGS holds spine, place attributes, website attributes, per-listing crawl-state and provenance. Crawl-state is per-listing but the crawler selects by
crawl_root(host) — a structural mismatch that forced host-group estimation hacks and per-listing fan-out. - No URL-source provenance. GP-vs-GMB URL source is not a flag; it had to be derived by joining four GBP/SerpAPI tables + dim_listings (
scripts/crawl-tail/prune-candidates.sql). - Place attributes scattered across
GBP_INFORMATION,serp_gbp_snapshots,serp_archive_extracts_*,v_gmb_locations, Pleper — no single resolved place record. - Fragile links. The place_id↔listing link lives inside the pipe-delimited
urls_urlsstring; parsing it produced the 329 wrong-place_id linkages and malformed crawl_roots. - NOFs have no path to becoming workable. They sit as net-new rows with no listing_id/subscriber_id, so they can't enter the outreach funnel.
2. Grain — DECIDED: listing_id
The spine is keyed by listing_id. Hard rules:
- Everything hangs off a listing — no outreach without a listing. Creating the listing is the prerequisite action of the funnel.
- Every prospect carries a
subscriber_id(the on-GP-file key) and should carry alisting_id. A NOF gets both minted when it is converted to a listing — that is precisely what "on file" means. - NOF = a pre-listing staging state, not a separate table of permanent residents. Unmatched NOFs are promoted into listings (noindex/nofollow) to enter the funnel. So effectively prospect = listings, where NOFs have been converted in.
- place_id and website are dimensions hung off the listing (M:N), carrying the Google-side and web-side attributes respectively.
(An earlier draft proposed a synthetic business-grain key — rejected: the funnel, the inbound lead, and CRM are all listing-anchored.)
3. Target model
SUPERSET listings (GP/GetLocal, on-file) ┐
├──▶ PROSPECT SPINE ───────────┐
NOF (converted into listings) ───────────┘ 1 row / listing_id │
listing_id (PK) │
subscriber_id (required) │
status (lifecycle, §5) │
segment / category / county │
is_active_client │
│ │
┌────────────────────────────────┼────────────────────┘
│ xref (M:N) │ xref (M:N)
▼ ▼
dim_place (1/place_id) dim_website (1/domain)
├ name/addr/cat/hours/phone ├ crawl_root (PK)
├ website (resolved) ├ crawl-state: next_at, attempts, backoff
├ reviews/rating (Pleper) ├ reachability: has_A / nxdomain / live / 530
├ source + observed_at /attr ├ attrs: tech, pixels, schema, AI-blocked,
├ verified, superseded_by │ booking/ecom/blog, TTFB, pages
└ coverage: serpapi/pleper/ └ source + observed_at
managed_gmb flags
FACTS (time-series — NOT in the dims):
fact_gmb_monthly (place_id, month, impressions, calls, clicks, directions…) ← managed GMB API
fact_web_monthly (crawl_root, month, GSC clicks/impr, GA4 sessions…)
3.1 Entities
- prospect spine —
listing_idPK,subscriber_id(required),status(§5),segment/category/county,is_active_client. Holds nothing that belongs to a place or website — those are dims. - dim_place — one resolved row per
place_id, merging SerpAPI + Pleper + managed-GMB attributes (current, slowly-changing). Per-sourcesource+observed_at;verified;superseded_by(place_id churn); coverage flagshas_serpapi/has_pleper/has_managed_gmb. - dim_website — one row per domain (
crawl_rootPK). Crawl-state moves here (crawl_next_at,crawl_slow_attempts, backoff) — host grain matches the crawler's selection. Reachability (has_A,nxdomain,live, last DoH check) + crawl attributes. - xref tables —
xref_listing_place(listing_id, place_id, link_type, confidence, source, verified)andxref_listing_website(listing_id, crawl_root, link_type, confidence, source). Relationships are M:N (one website ↔ many listings; one place_id ↔ many listings — the 3,609 multi-listing + 329 wrong-link cohorts live here as low-confidence/unverified rows).urls_urlskeeps the place_id link as the GP-facing artifact but is no longer the join mechanism. - facts — monthly metrics keyed by place_id (GMB) / crawl_root (web). Never folded into the dims.
- dim_ad_presence (future — see §11) — one row per listing's ad footprint: which platforms it advertises on (Google / Meta), active flag, first/last seen, ad count, sample creatives + landing URLs.
4. NOF → listing lifecycle (the funnel)
NOF conversion is listing creation. State machine on the spine status:
nof_unmatched
│ create listing: mint subscriber_id + listing_id, set NOINDEX/NOFOLLOW
▼
listing_draft ──find email──▶ outreach_confirm ────auto-publish after set time────▶ published
(noindex) (ask to confirm/ (indexed,
correct DATA, not sell) unless deleted)
│
┌─────────────────────┼──────────────────────┐
"delete" "amend" silence
│ │ │
delete nurture (CHANGED) nurture
was: sales outreach.
Too early — only the
confirm action belonged
at this stage.
- Confirm stage sells nothing — it asks the business to confirm/correct their listing data.
- amend → nurture (changed). Previously an amend triggered sales outreach; that proved too early. Only the confirm action was appropriate at that point, so amend now routes to nurture, same as silence. (Both amend and silence → nurture; only delete removes.)
- noindex/nofollow during the confirm window; auto-publish (index) after the set time unless the prospect asked to delete.
- A listing that points at a dead domain (from
dim_website.dns_status='nxdomain') is flagged for a publishing-platform action, never deleted — routed by surface viawebsite_action_queue(remove/replace the dead URL on the goldenpages.ie listing, or flag the GMB profile + sales hook). A dead URL on a live GP/GMB surface is a defect to action on that platform, not a row to remove.
5. The #1-inbound-lead risk (LIVE DEBATE — decide before mass auto-creation)
FCR's top inbound sales lead is listing creation — a business creating/claiming a listing is a high-intent signal we currently catch and convert. Auto-creating noindex listings for the whole NOF pool risks cannibalising that inbound: if a listing already exists, the prospect may no longer perform the high-intent create action.
Points to weigh (unresolved):
- noindex/nofollow keeps the auto-listing out of Google search, so a self-googling prospect still finds nothing of theirs there → may still create one (inbound preserved). But it is visible directly on goldenpages.ie, which could short-circuit the create.
- Consider excluding high-intent / high-value categories from auto-creation, or holding them in
listing_draftwithout publishing. - Instrument it: run a holdout and compare the inbound listing-creation rate with/without auto-creation before rolling out at scale.
This is the central trade-off of the whole NOF programme — the data model should support measuring it (status history + source of each listing: auto-created vs self-created).
6. Vector index off the spine (for outreach)
- Grain = the listing (= prospect spine). Embed one doc per listing.
- Embed the gaps, lead with the hooks. Doc text leads with outreach levers — no website / dead site / AI-invisible (GPTBot-403) / no reviews / weak tech (no booking, no schema) / not on Google — then name, category, county, services, review summary.
- Reuse existing Vectorize infra (deal-vector / similar-sites pattern); re-embed on spine change. Interrogate it to design outreach copy per cohort.
7. Cross-cutting
- Provenance everywhere. Every dim attribute carries
source+observed_at(+confidencewhere blended). Resolution = source priority (canonical GBP > serp_gbp_snapshots > managed GMB > archive harvest), materialised — not an opaque COALESCE. Repeatedly burned by stale blends (Yext cache, human-reviewed GMB→PL place_id gap). - place_id is not eternal. Google merges/closes/renumbers.
verified+superseded_by+ periodic re-resolve, or the place hub rots. - Verified-data publish gate. A
publishedlisting must render only tool-verified place/website data we actually fetched — never inferred competitor names or constructed URLs (the CA-Recruitment fabrication rule). Also protects GP domain quality from thin auto-pages (noindex until published). - Clustering / cost. dim_website CLUSTER BY (crawl_next_at, crawl_root); dim_place CLUSTER BY place_id; spine CLUSTER BY (status, segment). Removes the unclustered-full-scan-per-cron cost.
- Writes stay governed. Spine/dims are the read surface for advisor + card + dashboard + marketing; production writes via worker/cron, marketing via n8n (per CLAUDE.md). Dims read-only to marketing.
8. Column → dim mapping (reshape, don't rebuild)
PROSPECT_LISTINGS (88K+ pids) is re-shaped, not discarded:
| Today (PROSPECT_LISTINGS / source) | Target |
|---|---|
subscriber_id, segment, classification, is_active_client |
spine (+ mint listing_id for NOFs) |
place_id, place_id_source |
spine→place via xref_listing_place (.source) |
crawl_root, crawl_next_at, crawl_slow_attempts, crawl_error, crawl_reachable |
dim_website (crawl-state + reachability) |
crawl_has_* pixels, has_schema, crawl_ai_bots_blocked, crawl_ai_edge_blocked, crawl_page_count, crawl_response_ms |
dim_website (attributes) |
GBP_INFORMATION/serp_gbp_snapshots/serp_archive_extracts_*/v_gmb_locations |
dim_place (COALESCE w/ source priority + observed_at) |
GBP_REVIEWS / Pleper |
dim_place (reviews/rating/attrs) |
2_GMB_Business_Metrics/v_gmb_* monthly |
fact_gmb_monthly |
dim_listings_with_subscriber_id.urls_urls/urls_types/gl_crawlDomain |
listing GP website → xref + kept as GP artifact |
The 530/NXDOMAIN dead-domain work plugs in: the verdict is one dim_website.dns_status value that every listing on that domain inherits (no per-listing loop), and website_action_queue flags each for the right publishing-platform action — never a delete.
9. Open decisions
- Inbound-lead trade-off (§5) — auto-create scope / holdout — blocks mass NOF creation.
- Publish semantics — does "auto-publish after set time" mean index on goldenpages, and does it ever stay noindex?
- Identity for multi-location — one subscriber_id with many listing_ids, or many subscribers? (drives xref + CRM rollup.)
- dim_website key — registered domain vs
crawl_roothost (www vs apex). Recommend registered domain, normalise variants. - Facts — reuse
SUPERSET_MASTER(2_GMB_Business_Metrics etc.) via view, or copy into fcr_operations. Recommend reuse. - Ads-transparency join key (§11) — advertiser identity → listing, via landing domain (Google) / Facebook Page (Meta). Needs fuzzy identity resolution.
10. Suggested build order
- dim_website first (clear win): migrate crawl-state off PROSPECT_LISTINGS, wire DoH reachability, repoint crawler to host grain — feeds the dead-domain flag +
website_action_queue. - dim_place with source-priority + provenance (formalise
prune-candidates.sql's COALESCE). - Spine + xref, backfill from PROSPECT_LISTINGS; mint subscriber_id/listing_id for converted NOFs.
- Lifecycle status + NOF→listing funnel (incl. auto-create vs self-create source flag for the §5 measurement).
- Vector index off the spine.
11. Future data sources — ads transparency (Google + Meta)
To be added later: Google Ads Transparency Center + Meta Ads Library / Transparency Center. These reveal whether a prospect is already advertising and what they run.
- Signal value (high for sales):
- Already advertising = has budget + is marketing-active → warm for FCR paid services, or a competitor-displacement target.
- Not advertising while category peers are → a clean gap pitch.
- Ad creatives + landing pages + messaging feed the outreach vector (what they push, where they send traffic).
- Lands as
dim_ad_presence(attributes per listing) + optionally afact_ad_activitytime-series (ads run over date ranges, active/paused). Provenance +observed_atlike every other dim. - The hard part is the join key (open decision): the transparency centres key on advertiser identity, not place_id —
- Google Ads Transparency → verified advertiser, usually domain-anchored → join via landing/display domain →
dim_website→ listing. - Meta Ads Library → Facebook Page identity (has an official API) → join via the Page URL we already capture in
urls_urls(FACEBOOK type), or page-name match. - Both need an identity-resolution step + confidence, same pattern as place_id resolution; budget for fuzzy matching, not a clean key.
- Google Ads Transparency → verified advertiser, usually domain-anchored → join via landing/display domain →
- Access/ToS note: Meta Ad Library has an official API; Google's Transparency Center is more scrape-bound — confirm sanctioned access before building, and treat creatives as public-but-attributed data.
12. Business summary generation (embedding core + listing content + outreach)
A generated business summary / long-description per prospect is the semantic
core of the vector index (drives look-alike retrieval), AND a reusable asset:
it doubles as the NOF listing's long-description (a published noindex listing
needs body content) and the outreach personalisation raw material. Generate
once, reuse three ways. Stored on the spine: PROSPECT_LISTINGS.business_summary
(+ business_summary_at, business_summary_source) — added 2026-05-31. The
reindex handler already leads its embedding doc with it (falls back to structured
signals until generated).
Inputs (ALL tool-fetched — verified-only, never invented):
GBP_INFORMATION.description— the business's own GBP blurb (gold). Sparse now (~6,729) — grows with Pleper.GBP_SERVICES(service_name/service_description/service_price) — what they offer.PROSPECT_LISTINGS.crawl_raw(homepage title/meta/text) — site content; ~56,688 have usable text (crawl_word_count>50).primary_category/gp_category,city/county,review_count/rating.
Prioritisation tiers (of ~210K universe):
- Tier 1 — LLM summary (~58,510 = has GBP description OR crawl text): the addressable/outreachable set. Haiku-class model.
- Tier 2 — templated, no LLM (~150K thin): deterministic "{name} is a {category} in {town}, {county}." — free, no fabrication risk.
- Defer dead-domain/junk.
Model + guardrail: Haiku-class. Prompt: "Summarise THIS business in ~80–150 words, third-person, factual, from the PROVIDED GBP/website/services text ONLY. Do NOT invent services, claims, awards, or competitors. If inputs are thin, output only what's verifiable." Enforces the CA-Recruitment no-fabrication rule — same standard required to publish the NOF listing. Record business_summary_source (which inputs were present) for auditability.
Cost: Tier-1 ~58.5K × ~1.2K tokens ≈ ~$90 on Haiku (one-off; regenerate only on input change). Tier-2 free.
Pipeline: a batch handler (dashboard-prospect-summary-generate) — pull N prospects WHERE business_summary IS NULL AND has inputs, assemble GBP desc + services + crawl text, call model, write back; paginated/driver like the crawl + reindex drivers. Idempotent; regenerate when crawl/GBP refreshes (deal-brief-style staleness). Feeds embedding (re-embed incrementally) + listing content + outreach.
Related: docs/hubspot-data-architecture.md, docs/bigquery-and-sync.md, docs/prospect-listings-composition-2026-05-28.md, scripts/crawl-tail/.
FCR Dashboard documentation · generated from docs/ · keep counts verified, not guessed.