PROSPECT_LISTINGS — composition, provenance & resolve-queue state (2026-05-28)
Snapshot audit of what's actually in listingmanager-1529856313699.fcr_operations.PROSPECT_LISTINGS
(108,071 rows) — how listings got there, how they reconcile to the master book, what
GBP/crawl data we hold, and what SerpAPI resolution work is sitting un-promoted.
Numbers are a point-in-time snapshot (2026-05-28); re-run the queries in §7 to refresh.
Update 2026-06-01 — three figures below have moved materially; corrections:
- NOFs (net-new, no
subscriber_id): now 42,555 total / 29,916 in-universe (was 12,242). The SERP-archive harvest (serp_archive_harvest, 30,313 rows) landed after this snapshot, roughly tripling the net-new/NOF layer.- GBP data coverage: now 84,841 place_id rows carry
has_serp_data=TRUE(review_count 86,114, rating 82,567) — effectively all listings and place_id NOFs that have been enriched, NOT the 15,360 in §1/§5. SerpAPI + Pleper enrichment ran broadly after 2026-05-28; the GBP basics live on the spine (review_count/rating/has_serp_data), not mainly inGBP_INFORMATION(~13.5K, the description sub-table only). The ~33K real-place_id rows still without it are recent harvest held behind theenrich_enabledgate.- The crawl is complete across all live sites, with HTTP response breakouts confirmed: of ~80,270 unique domains, 78% cleanly crawled; failures split 5xx ≈ 8,675 / 4xx ≈ 5,976 / timeout ≈ 2,352 / robots ≈ 573 unique domains. 530/NXDOMAIN dead domains were DoH-confirmed (3,775) and routed to
website_action_queue; where a live site returned 403, Ahrefs batch-analysis extracted SEO insights intodim_seo(10,570 domains). Full as-built detail: Prospect Enrichment & Opportunity Index.§1 below is refreshed to 2026-06-01; §2–§7 remain the original 2026-05-28 snapshot (reconciliation arithmetic, provenance, resolve-queue) — re-run the §7 queries to refresh those.
1. Population matrix (file-status × place_id × web)
Distinct (refreshed 2026-06-01): 152,091 subscribers · 118,606 Google place_ids · 104,226 LST: synthetic · 80,270 web domains.
| Cohort | place_id / web | Listings | has GBP data | crawled | site live |
|---|---|---|---|---|---|
| Active client (paying) | google place_id + web | 3,122 | 2,961 | 3,116 | 2,834 |
| google place_id + no web | 138 | 112 | — | — | |
| On-file, not active (activation pool) | google place_id + web | 51,723 | 41,436 | 50,791 | 41,175 |
| google place_id + no web | 21,069 | 10,218 | — | — | |
LST: synthetic + web |
29,785 | 0 | 29,694 | 22,278 | |
LST: synthetic + no web |
74,441 | 0 | — | — | |
| NOF (no subscriber_id, net-new) | google place_id + web | 30,781 | 20,926 | 29,013 | 24,271 |
| google place_id + no web | 11,774 | 9,188 | — | — |
On-file = 180,278 · NOF = 42,555 · active = 3,260 · total = 222,833. "has GBP data" = has_serp_data on the spine (LST: synthetic rows have no real place_id → never GBP-enrichable); "no web" rows are never crawled. Crawl is complete across all live sites; HTTP-response and 403→Ahrefs detail is in Prospect Enrichment & Opportunity Index.
2. Identity keys
- Primary key =
place_id. Real Google place_ids areChIJ…(88,285 of 88,293). LST:synthetic pid ="LST:" || dim.listing_id(e.g.LST:IE_32671122_468295_13141). Minted for on-file listings with a website but no resolved Google place_id — crawlable (crawl keys oncrawl_root), but never GBP-enrichable (no real place_id →has_serp_data=FALSEfor all).- PROSPECT_LISTINGS has no
listing_idcolumn — the only listing_id linkage is via theLST:prefix. Maps-resolved rows link to the book only through the place_id embedded indim.urls_urls.
3. Master book ↔ prospect
Master = SUPERSET_MASTER.dim_listings_with_subscriber_id (213,835 rows / 180,343 distinct listing_id).
The Google place_id lives inside urls_urls (pipe-delimited |, aligned to urls_types; the GOOGLEMAPS value is the ChIJ… place_id — there is no place_id column).
| Listings | |
|---|---|
| Distinct listings in master | 180,343 |
| → in prospect via Google place_id | 62,452 |
→ in prospect via LST: listing_id |
19,777 |
| In prospect | 82,229 |
| Not in prospect | 98,114 (54%) |
- Book
ChIJplace_ids: 57,524 distinct; 57,510 are in prospect, 14 are not (the "off-by-14"). - The 98,114 not-in-prospect are overwhelmingly listings with neither a usable website nor a Google place_id — nothing to crawl or enrich, so never ingested.
4. Provenance of prospect's 88,284 Google place_ids (reconciliation)
88,284 = 57,510 book-sourced (in dim.urls_urls; book total 57,524 − 14 not-in-prospect)
+ 19,805 resolve-promoted (distinct ChIJ, prospect_resolve_queue.promoted=TRUE)
− 1,167 overlap (in BOTH book and promoted — don't double-count)
+ 12,136 net-new harvest (SayMore SERP + competitor enrich — neither book nor resolve)
| Provenance | place_ids | with subscriber | NOF (no sub) |
|---|---|---|---|
| Book only | 56,343 | 56,226 | 117 |
| Promoted only | 18,638 | 18,638 | 0 |
| Book + promoted (overlap) | 1,167 | 1,167 | 0 |
| Net-new harvest | 12,136 | 12 | 12,124 |
The 12,136 net-new are ~99.9% NOF (no subscriber). So "subscribers (on-file) with a place_id" ≈ book + resolve-promoted − overlap; the net-new harvest is a separate NOF layer. prospect is NOT a strict subset of the book — it's the book's web/maps-resolvable slice plus the net-new harvest.
5. GBP / SerpAPI / Pleper (where we have a place_id)
place_id ≠ GBP data. Only 15,360 of ~88k place_id rows carry has_serp_data=TRUE — the rest have the place_id (the key) but enrichment was never run (enrich_enabled cost gate). What we have is dated and currently all <90 days old (SerpAPI rows span 2026-03-22 → 05-28). 4,688 Pleper errors logged (pleper_error/pleper_error_at).
Table (fcr_operations) |
Source | Date column | Key | Holds |
|---|---|---|---|---|
serp_gbp_snapshots |
SerpAPI | snapshot_date, synced_at |
place_id | place details, rating, review/photo/post counts, hours/desc flags, raw_json |
GBP_REVIEWS |
SerpAPI | review_date, crawled_at |
place_id | individual reviews (≤20/pid), owner responses |
GBP_SERVICES |
Pleper | scraped_date |
place_id | services list (name/desc/price/category) + photo/video counts |
Split: SerpAPI = place details + reviews + hours flag (NOT a services list). Pleper = the services list + authoritative photo/video counts. PROSPECT_LISTINGS.last_updated = GBP refresh stamp (last writer of SerpAPI/Pleper paths wins).
6. Website crawls
One engine (crawlWebsite(), worker/src/handlers/crawl-website.js), two passes + one external audit:
- (a) Shallow — LIVE. 34 typed
crawl_*cols + 8KBcrawl_rawJSON (technical/features/technologies/structuredData/sitemap/robots/aiReadiness/social/contact). Includes AI-block cols (crawl_ai_bots_blocked/_edge_blocked/_live_blocked) andcrawl_slow_attempts. 65,308 crawled (2026-05-23→28). - (b) Deep — DISABLED since 2026-05-24 (tripped bot-detection). Adds
crawl_deep_at,crawl_service_page_count,crawl_location_page_count. 8,973 rows got it before shutoff. - (c) InSites / Digital Footprint — external (n8n). Off-page (SEO/backlinks/rank/GBP-completeness/social/ads). Not persisted to BQ — full report lives on InSites, KV-cached 1hr; only audit metadata in
fcr_operations.USG_INSITES_AUDIT_INVENTORY.
JS / tag-manager gap: the crawler is fetch-only — never executes JS, so tags injected at runtime via Google Tag Manager are missed (crawl_has_analytics/_google_ads_tag/_fb_pixel under-report). Fix is in-house: env.BROWSER (Cloudflare Browser Rendering / @cloudflare/puppeteer) already exists and is proven in serp-screenshot.js — just not wired to the crawler.
7. prospect_resolve_queue — SerpAPI resolution state (101,114 rows)
Working table for resolving on-file listings (keyed by listing_id/subscriber_id) to a Google place_id. A row only reaches PROSPECT_LISTINGS when promoted = TRUE. Key cols: cand_place_id, decision, promoted, status, match signals (match_website/match_phone/match_name_score/match_distance_m/match_category).
| decision / status | promoted | rows | candidate pid not in prospect | note |
|---|---|---|---|---|
promote / resolved |
TRUE | 19,353 | — | upserted ✓ |
promote_loose / resolved |
TRUE | 1,245 | — | upserted ✓ |
promote / resolved |
FALSE | 1,281 | 568 distinct | clean missed upsert — high conf (1,127 website matches, name 0.66); promote step never ran |
null / resolved |
FALSE | 30,418 | 16,485 distinct | have a candidate but no decision applied; low conf (0 website matches, name 0.34) — need classifier, not blind upsert |
added_no_place_id / resolved |
FALSE | 19,704 | 11,543 distinct | added as LST:; candidate not attached. 7,363 had a website match → re-review |
null / pending_phase2_nocat |
FALSE | 28,785 | 0 (no candidate yet) | never SerpAPI-resolved — carved out for having NO primary_category (see 08-carve-no-category.cjs) |
null / pending_clientfix |
FALSE | 136 | 0 | needs client data fix |
| no_match / error | FALSE | 188 | — |
Action items:
- Promote the 568 (
promote/promoted=FALSE) — already-decided confident matches, just never written. Run03-promote.cjs. - 28,785
pending_phase2_nocat— the next SerpAPI run. They lackprimary_category, which the category-class match gate needs → must promote the first category from the listing's category list intoprimary_categorybefore/at resolve time. Multi-parallel SerpAPI run. - 30,418 undecided + 7,363 website-matched LST adds — re-run classifier / re-review before promoting.
Toolchain
scripts/prospect-resolve/: 01-create-and-seed-queue → 02-resolve-batch (SerpAPI) → _classify / 09-reclassify (CATEGORY_MAPPING-driven) → 03-promote (upsert to PROSPECT_LISTINGS). 04-add-no-placeid-prospects (LST adds), 06-seed-phase2-nap, 08-carve-no-category, 13-export-pending-phase2-nocat.
FCR Dashboard documentation · generated from docs/ · keep counts verified, not guessed.