AI Visibility Index — Phase 2 Dataset: 2,729 Businesses Across 14 Verticals and 4 Metropolitan Markets, with Full Off-Page Signal Decomposition
收藏DataCite Commons2026-05-06 更新2026-05-07 收录
下载链接:
https://zenodo.org/doi/10.5281/zenodo.20048941
下载链接
链接失效反馈官方服务:
资源简介:
Phase 2 dataset for "Business Visibility in Generative AI Search" (paper DOI 10.5281/zenodo.20048424). Expands the Phase 1 audit (n=1,000, dataset DOI 10.5281/zenodo.20048614) to n=2,729 businesses across 14 verticals and 4 metropolitan markets (Los Angeles, New York City, Chicago, Sydney), and adds full off-page enrichment.
Contents:
- phase2-anonymized-dataset.csv — 2,729 rows × 74 columns. Per-business AI-visibility outcomes across ChatGPT, Anthropic Claude, Google Gemini, Perplexity, and Google AI Overviews (mention + recommendation flags per model, composite visibility score, share-of-model, prominence) plus on-page signals (schema markup, FAQ content, comparison content, indexed pages, blog count, llms.txt presence and length, citability composite), reputation signals (Domain Authority, Google review count and rating), off-page enrichment (SpyFu organic clicks/value/keywords/growth, Reddit and Quora mention counts, directory presence flags for Wikipedia / LinkedIn / Crunchbase / GBP / BBB / Yelp / Trustpilot / G2 / Capterra, YouTube channel and mention counts, review-platform breadth, off-page composite), partial Moz backlinks, and a 12-month press-coverage rollup. Several signal layers (on-page, llms.txt, Moz DA, SpyFu) were enriched on a sub-sample rather than the full n=2,729; per-column null rates and the analytic-n caveat are documented in detail in README.md so reviewers can compute statistics on the correct denominator. Identifiers anonymized as BIZ_P2_xxxxx (zero-padded; namespaced separately from Phase 1 BIZ_xxxx so the two datasets cannot collide). Brand names, URLs, addresses, phone numbers, and place IDs are dropped.
- README.md — complete column-by-column codebook, data-quality flags including per-cell sample sizes, missingness notes per signal source, and reproduction instructions for the headline results.
- extract-phase2-zenodo.ts — the extraction script. Reads Supabase credentials from environment variables only and never embeds secrets, so the extraction is reproducible from the source database without leaking access tokens.
Each business was queried across five AI systems using a curated set of intent prompts per industry; mentions were extracted via canonical-name string matching with a Claude second-pass classification, and aggregated into a 0–100 visibility score with per-model mention and recommendation flags. The 14-vertical × 4-market design was filled by sourcing per-cell business lists with Perplexity sonar-pro and capping each cell at n=100; some local cells fell short of the n=100 target and are reported as-is rather than padded (full per-cell counts are listed in README.md).
Released under CC-BY-4.0. The author is the founder of MentionLayer, the generative engine optimization platform whose underlying methodology motivated this work; data and analysis outputs are released openly to enable independent verification.
提供机构:
Zenodo
创建时间:
2026-05-06



