PROSPECT_LISTINGS — composition, provenance & resolve-queue state (2026-05-28)

Snapshot audit of what's actually in listingmanager-1529856313699.fcr_operations.PROSPECT_LISTINGS (108,071 rows) — how listings got there, how they reconcile to the master book, what GBP/crawl data we hold, and what SerpAPI resolution work is sitting un-promoted.

Numbers are a point-in-time snapshot (2026-05-28); re-run the queries in §7 to refresh.

Update 2026-06-01 — three figures below have moved materially; corrections:

NOFs (net-new, no subscriber_id): now 42,555 total / 29,916 in-universe (was 12,242). The SERP-archive harvest (serp_archive_harvest, 30,313 rows) landed after this snapshot, roughly tripling the net-new/NOF layer.

GBP data coverage: now 84,841 place_id rows carry has_serp_data=TRUE (review_count 86,114, rating 82,567) — effectively all listings and place_id NOFs that have been enriched, NOT the 15,360 in §1/§5. SerpAPI + Pleper enrichment ran broadly after 2026-05-28; the GBP basics live on the spine (review_count/rating/has_serp_data), not mainly in GBP_INFORMATION (~13.5K, the description sub-table only). The ~33K real-place_id rows still without it are recent harvest held behind the enrich_enabled gate.

The crawl is complete across all live sites, with HTTP response breakouts confirmed: of ~80,270 unique domains, 78% cleanly crawled; failures split 5xx ≈ 8,675 / 4xx ≈ 5,976 / timeout ≈ 2,352 / robots ≈ 573 unique domains. 530/NXDOMAIN dead domains were DoH-confirmed (3,775) and routed to website_action_queue; where a live site returned 403, Ahrefs batch-analysis extracted SEO insights into dim_seo (10,570 domains). Full as-built detail: Prospect Enrichment & Opportunity Index.

§1 below is refreshed to 2026-06-01; §2–§7 remain the original 2026-05-28 snapshot (reconciliation arithmetic, provenance, resolve-queue) — re-run the §7 queries to refresh those.

1. Population matrix (file-status × place_id × web)

Distinct (refreshed 2026-06-01): 152,091 subscribers · 118,606 Google place_ids · 104,226 LST: synthetic · 80,270 web domains.

Cohort	place_id / web	Listings	has GBP data	crawled	site live
Active client (paying)	google place_id + web	3,122	2,961	3,116	2,834
	google place_id + no web	138	112	—	—
On-file, not active (activation pool)	google place_id + web	51,723	41,436	50,791	41,175
	google place_id + no web	21,069	10,218	—	—
	`LST:` synthetic + web	29,785	0	29,694	22,278
	`LST:` synthetic + no web	74,441	0	—	—
NOF (no subscriber_id, net-new)	google place_id + web	30,781	20,926	29,013	24,271
	google place_id + no web	11,774	9,188	—	—

On-file = 180,278 · NOF = 42,555 · active = 3,260 · total = 222,833. "has GBP data" = has_serp_data on the spine (LST: synthetic rows have no real place_id → never GBP-enrichable); "no web" rows are never crawled. Crawl is complete across all live sites; HTTP-response and 403→Ahrefs detail is in Prospect Enrichment & Opportunity Index.

2. Identity keys

Primary key = place_id. Real Google place_ids are ChIJ… (88,285 of 88,293).
LST: synthetic pid = "LST:" || dim.listing_id (e.g. LST:IE_32671122_468295_13141). Minted for on-file listings with a website but no resolved Google place_id — crawlable (crawl keys on crawl_root), but never GBP-enrichable (no real place_id → has_serp_data=FALSE for all).
PROSPECT_LISTINGS has no listing_id column — the only listing_id linkage is via the LST: prefix. Maps-resolved rows link to the book only through the place_id embedded in dim.urls_urls.

3. Master book ↔ prospect

Master = SUPERSET_MASTER.dim_listings_with_subscriber_id (213,835 rows / 180,343 distinct listing_id). The Google place_id lives inside urls_urls (pipe-delimited |, aligned to urls_types; the GOOGLEMAPS value is the ChIJ… place_id — there is no place_id column).

	Listings
Distinct listings in master	180,343
→ in prospect via Google place_id	62,452
→ in prospect via `LST:` listing_id	19,777
In prospect	82,229
Not in prospect	98,114 (54%)

Book ChIJ place_ids: 57,524 distinct; 57,510 are in prospect, 14 are not (the "off-by-14").
The 98,114 not-in-prospect are overwhelmingly listings with neither a usable website nor a Google place_id — nothing to crawl or enrich, so never ingested.

4. Provenance of prospect's 88,284 Google place_ids (reconciliation)

88,284  =  57,510  book-sourced (in dim.urls_urls; book total 57,524 − 14 not-in-prospect)
        +  19,805  resolve-promoted (distinct ChIJ, prospect_resolve_queue.promoted=TRUE)
        −   1,167  overlap (in BOTH book and promoted — don't double-count)
        +  12,136  net-new harvest (SayMore SERP + competitor enrich — neither book nor resolve)

Provenance	place_ids	with subscriber	NOF (no sub)
Book only	56,343	56,226	117
Promoted only	18,638	18,638	0
Book + promoted (overlap)	1,167	1,167	0
Net-new harvest	12,136	12	12,124

The 12,136 net-new are ~99.9% NOF (no subscriber). So "subscribers (on-file) with a place_id" ≈ book + resolve-promoted − overlap; the net-new harvest is a separate NOF layer. prospect is NOT a strict subset of the book — it's the book's web/maps-resolvable slice plus the net-new harvest.

5. GBP / SerpAPI / Pleper (where we have a place_id)

place_id ≠ GBP data. Only 15,360 of ~88k place_id rows carry has_serp_data=TRUE — the rest have the place_id (the key) but enrichment was never run (enrich_enabled cost gate). What we have is dated and currently all <90 days old (SerpAPI rows span 2026-03-22 → 05-28). 4,688 Pleper errors logged (pleper_error/pleper_error_at).

Table (`fcr_operations`)	Source	Date column	Key	Holds
`serp_gbp_snapshots`	SerpAPI	`snapshot_date`, `synced_at`	place_id	place details, rating, review/photo/post counts, hours/desc flags, `raw_json`
`GBP_REVIEWS`	SerpAPI	`review_date`, `crawled_at`	place_id	individual reviews (≤20/pid), owner responses
`GBP_SERVICES`	Pleper	`scraped_date`	place_id	services list (name/desc/price/category) + photo/video counts

Split: SerpAPI = place details + reviews + hours flag (NOT a services list). Pleper = the services list + authoritative photo/video counts. PROSPECT_LISTINGS.last_updated = GBP refresh stamp (last writer of SerpAPI/Pleper paths wins).

6. Website crawls

One engine (crawlWebsite(), worker/src/handlers/crawl-website.js), two passes + one external audit:

(a) Shallow — LIVE. 34 typed crawl_* cols + 8KB crawl_raw JSON (technical/features/technologies/structuredData/sitemap/robots/aiReadiness/social/contact). Includes AI-block cols (crawl_ai_bots_blocked/_edge_blocked/_live_blocked) and crawl_slow_attempts. 65,308 crawled (2026-05-23→28).
(b) Deep — DISABLED since 2026-05-24 (tripped bot-detection). Adds crawl_deep_at, crawl_service_page_count, crawl_location_page_count. 8,973 rows got it before shutoff.
(c) InSites / Digital Footprint — external (n8n). Off-page (SEO/backlinks/rank/GBP-completeness/social/ads). Not persisted to BQ — full report lives on InSites, KV-cached 1hr; only audit metadata in fcr_operations.USG_INSITES_AUDIT_INVENTORY.

JS / tag-manager gap: the crawler is fetch-only — never executes JS, so tags injected at runtime via Google Tag Manager are missed (crawl_has_analytics/_google_ads_tag/_fb_pixel under-report). Fix is in-house: env.BROWSER (Cloudflare Browser Rendering / @cloudflare/puppeteer) already exists and is proven in serp-screenshot.js — just not wired to the crawler.

7. `prospect_resolve_queue` — SerpAPI resolution state (101,114 rows)

Working table for resolving on-file listings (keyed by listing_id/subscriber_id) to a Google place_id. A row only reaches PROSPECT_LISTINGS when promoted = TRUE. Key cols: cand_place_id, decision, promoted, status, match signals (match_website/match_phone/match_name_score/match_distance_m/match_category).

decision / status	promoted	rows	candidate pid not in prospect	note
`promote` / resolved	TRUE	19,353	—	upserted ✓
`promote_loose` / resolved	TRUE	1,245	—	upserted ✓
`promote` / resolved	FALSE	1,281	568 distinct	clean missed upsert — high conf (1,127 website matches, name 0.66); promote step never ran
`null` / resolved	FALSE	30,418	16,485 distinct	have a candidate but no decision applied; low conf (0 website matches, name 0.34) — need classifier, not blind upsert
`added_no_place_id` / resolved	FALSE	19,704	11,543 distinct	added as `LST:`; candidate not attached. 7,363 had a website match → re-review
`null` / `pending_phase2_nocat`	FALSE	28,785	0 (no candidate yet)	never SerpAPI-resolved — carved out for having NO primary_category (see `08-carve-no-category.cjs`)
`null` / `pending_clientfix`	FALSE	136	0	needs client data fix
no_match / error	FALSE	188	—

Action items:

Promote the 568 (promote/promoted=FALSE) — already-decided confident matches, just never written. Run 03-promote.cjs.
28,785 pending_phase2_nocat — the next SerpAPI run. They lack primary_category, which the category-class match gate needs → must promote the first category from the listing's category list into primary_category before/at resolve time. Multi-parallel SerpAPI run.
30,418 undecided + 7,363 website-matched LST adds — re-run classifier / re-review before promoting.

Toolchain

scripts/prospect-resolve/: 01-create-and-seed-queue → 02-resolve-batch (SerpAPI) → _classify / 09-reclassify (CATEGORY_MAPPING-driven) → 03-promote (upsert to PROSPECT_LISTINGS). 04-add-no-placeid-prospects (LST adds), 06-seed-phase2-nap, 08-carve-no-category, 13-export-pending-phase2-nocat.

FCR Dashboard documentation · generated from docs/ · keep counts verified, not guessed.