datapointai/vibe-landing-page-arena

Name: datapointai/vibe-landing-page-arena
Creator: datapointai
Published: 2026-03-27 17:44:44
License: 暂无描述

Hugging Face2026-03-27 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/datapointai/vibe-landing-page-arena

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 task_categories: - image-classification - visual-question-answering language: - en tags: - human-preference - design - vibe-coding - pairwise-comparison - bradley-terry - web-design - ai-code-generation - landing-pages pretty_name: "Vibe Landing Page Arena" size_categories: - 1K<n<10K --- <img src="https://huggingface.co/datasets/datapointai/vibe-landing-page-arena/resolve/main/datapointlogo.png" alt="Datapoint AI" width="400"> # Vibe Landing Page Arena A large-scale human preference dataset for evaluating AI-generated landing page design quality. 36,000 pairwise judgments from 3,492 annotators comparing landing pages generated by Claude Code, Cursor, Lovable, and Replit across 100 prompts and 4 design dimensions. ## Overview | Metric | Value | |--------|-------| | Total judgments | 36,000 | | Unique annotators | 3,492 | | Prompts | 100 | | Business categories | 97 | | Design tones | 82 | | Tools compared | 4 (Claude Code, Cursor, Lovable, Replit) | | Evaluation dimensions | 4 (aesthetic, typography, layout, completeness) | | Judgments per matchup per dimension | 15 | | Tool pairs per prompt | 6 (all C(4,2) combinations) | ## How the data was collected 1. **100 detailed prompts** were written, each specifying a business name, brand description, page sections (hero, features, pricing, testimonials, etc.), color palette, typography, and design tone. 2. Each prompt was sent to **4 AI code generation tools**: Claude Code (Sonnet 4.6), Cursor (Sonnet 4.6), Lovable, and Replit. Each tool generated a single-file HTML landing page. 3. Full-page **screenshots** were captured at 1440x900 using Playwright. 4. All 6 possible tool pairs per prompt were served as **pairwise image comparisons** on the [Datapoint](https://trydatapoint.com) annotation platform. 5. For each comparison, annotators evaluated **4 dimensions independently**: aesthetic appeal, typography, layout, and completeness. 6. **Display order was randomized** per serving to eliminate left/right position bias. 7. Each matchup received **15 independent judgments per dimension**. ## Dataset Structure ### `comparisons` (2,400 rows) Each row is one aggregated comparison: one tool pair, one dimension, with screenshots, prompt text, and vote counts from 15 annotators. | Column | Type | Description | |--------|------|-------------| | `image_a` | image | Full-page screenshot of tool_a's generated landing page | | `image_b` | image | Full-page screenshot of tool_b's generated landing page | | `tool_a` | string | First tool in the pair | | `tool_b` | string | Second tool in the pair | | `prompt_id` | int | Prompt ID (1-100) | | `brand` | string | Business name from the prompt | | `category` | string | Business category (e.g., "SaaS", "fintech", "restaurant") | | `tone` | string | Design tone (e.g., "minimalist", "bold", "luxury") | | `prompt` | string | Full prompt text used to generate the landing page | | `dimension` | string | Evaluation dimension (see questions below) | | `dimension_question` | string | The exact question annotators answered | | `votes_a` | int | Number of annotators who preferred tool_a (out of 15) | | `votes_b` | int | Number of annotators who preferred tool_b (out of 15) | | `winner` | string | "A" (tool_a majority), "B" (tool_b majority), or "tie" | ### Evaluation Dimensions Each comparison was evaluated on 4 independent dimensions. Annotators answered one question per dimension: | Dimension | Question | |-----------|----------| | **aesthetic** | "Which design looks better at first glance?" | | **typography** | "Which has better font choices, sizing, and readability?" | | **layout** | "Which has better spacing, alignment, and visual flow?" | | **completeness** | "Which has more fully-built sections with no empty or broken areas?" | ### `prompts` (100 rows) | Column | Type | Description | |--------|------|-------------| | `id` | int | Prompt ID (1-100) | | `category` | string | Business category | | `tone` | string | Design tone | | `prompt` | string | Full prompt text | ### `screenshots` (400 images) Full-page screenshots of all generated landing pages (100 prompts x 4 tools), captured at 1440x900 viewport. ## Key Findings ### Overall Rankings (Bradley-Terry) | Rank | Tool | Strength | 95% CI | |------|------|----------|--------| | 1 | Cursor | 0.271 | 0.265 - 0.277 | | 2 | Claude | 0.269 | 0.263 - 0.274 | | 3 | Lovable | 0.262 | 0.256 - 0.267 | | 4 | Replit | 0.199 | 0.194 - 0.204 | The top 3 tools are **statistically indistinguishable** (Cursor vs Claude: p = 1.0; Claude vs Lovable: p = 0.14). Replit is significantly behind (p < 0.001). ### Dimension Specialization No single tool wins every dimension: | Dimension | #1 | #2 | #3 | #4 | |-----------|----|----|----|----| | Aesthetic | Lovable | Cursor | Claude | Replit | | Typography | Cursor | Claude | Lovable | Replit | | Layout | Lovable | Claude | Cursor | Replit | | Completeness | Claude | Cursor | Lovable | Replit | ### Category Specialization - **Lovable** ranks #1 in 35/97 categories (consumer brands, lifestyle, ecommerce) - **Claude** ranks #1 in 32/97 categories (professional services, enterprise, fintech) - **Cursor** ranks #1 in 17/97 categories (SaaS, tech, agency) - **Replit** ranks #1 in 13/97 categories (developer tools, compliance) ## Usage ```python from datasets import load_dataset # Load pairwise comparison judgments comparisons = load_dataset("datapointai/vibe-landing-page-arena", "comparisons") # Load prompts prompts = load_dataset("datapointai/vibe-landing-page-arena", "prompts") # Load screenshots screenshots = load_dataset("datapointai/vibe-landing-page-arena", "screenshots") ``` ### Reproduce the Bradley-Terry analysis ```python import numpy as np from scipy.optimize import minimize # Count wins per tool pair wins = {} for row in comparisons["train"]: a, b = row["tool_a"], row["tool_b"] if row["choice"] == "A": wins[(a, b)] = wins.get((a, b), 0) + 1 else: wins[(b, a)] = wins.get((b, a), 0) + 1 # Fit Bradley-Terry model tools = ["claude", "cursor", "lovable", "replit"] idx = {t: i for i, t in enumerate(tools)} def neg_log_likelihood(params): nll = 0.0 for (a, b), count in wins.items(): p = 1.0 / (1.0 + np.exp(params[idx[b]] - params[idx[a]])) nll -= count * np.log(max(p, 1e-10)) return nll result = minimize(neg_log_likelihood, np.zeros(4), method="L-BFGS-B", bounds=[(0,0)] + [(None,None)]*3) strengths = np.exp(result.x) / np.exp(result.x).sum() for tool, s in sorted(zip(tools, strengths), key=lambda x: -x[1]): print(f"{tool}: {s:.4f}") ``` ## Methodology - **Ranking model:** Bradley-Terry with 1,000 bootstrap iterations for 95% confidence intervals - **Significance testing:** Likelihood ratio tests between adjacent-ranked tools - **Position bias:** Verified negligible via BT model with position parameter (delta = -0.03, CI crosses zero). Display order randomized per serving. - **Annotator quality:** Platform uses calibration tasks with known gold-standard answers to compute annotator trust scores. 60% of calibrated annotators achieved perfect trust scores (1.0). ## Comparison to Related Work | | This dataset | [Vibe Design Arena v1](https://huggingface.co/datasets/datapointai/vibe-design-arena) | Verita AI Study | |---|---|---|---| | Prompts | 100 (controlled) | 60 (real-world apps) | 80 (controlled) | | Tools | 4 | 6 | 4 | | Dimensions | 4 | 1 | 4 | | Total judgments | 36,000 | ~53,000 | 1,260 | | Annotators | 3,492 | unknown | 5 | | Judgments per matchup | 15 per dimension | 30 | ~3 | | Position randomization | Yes | Yes | Not reported | | Statistical model | Bradley-Terry + bootstrap CI | Win rate | Bradley-Terry | ## License CC-BY-4.0 ## Citation ```bibtex @dataset{vibe_landing_page_arena_2026, title={Vibe Landing Page Arena: Human Preference Evaluation of AI-Generated Landing Page Design}, author={Datapoint AI}, year={2026}, url={https://huggingface.co/datasets/datapointai/vibe-landing-page-arena}, note={36,000 pairwise judgments across 4 tools, 100 prompts, and 4 design dimensions} } ``` ## Contact Built by [Datapoint AI](https://trydatapoint.com). Questions or feedback: sales@trydatapoint.com

提供机构：

datapointai

5,000+

优质数据集

54 个

任务类型

进入经典数据集