datapointai/vibe-landing-page-arena
收藏Hugging Face2026-03-27 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/datapointai/vibe-landing-page-arena
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
task_categories:
- image-classification
- visual-question-answering
language:
- en
tags:
- human-preference
- design
- vibe-coding
- pairwise-comparison
- bradley-terry
- web-design
- ai-code-generation
- landing-pages
pretty_name: "Vibe Landing Page Arena"
size_categories:
- 1K<n<10K
---
<img src="https://huggingface.co/datasets/datapointai/vibe-landing-page-arena/resolve/main/datapointlogo.png" alt="Datapoint AI" width="400">
# Vibe Landing Page Arena
A large-scale human preference dataset for evaluating AI-generated landing page design quality. 36,000 pairwise judgments from 3,492 annotators comparing landing pages generated by Claude Code, Cursor, Lovable, and Replit across 100 prompts and 4 design dimensions.
## Overview
| Metric | Value |
|--------|-------|
| Total judgments | 36,000 |
| Unique annotators | 3,492 |
| Prompts | 100 |
| Business categories | 97 |
| Design tones | 82 |
| Tools compared | 4 (Claude Code, Cursor, Lovable, Replit) |
| Evaluation dimensions | 4 (aesthetic, typography, layout, completeness) |
| Judgments per matchup per dimension | 15 |
| Tool pairs per prompt | 6 (all C(4,2) combinations) |
## How the data was collected
1. **100 detailed prompts** were written, each specifying a business name, brand description, page sections (hero, features, pricing, testimonials, etc.), color palette, typography, and design tone.
2. Each prompt was sent to **4 AI code generation tools**: Claude Code (Sonnet 4.6), Cursor (Sonnet 4.6), Lovable, and Replit. Each tool generated a single-file HTML landing page.
3. Full-page **screenshots** were captured at 1440x900 using Playwright.
4. All 6 possible tool pairs per prompt were served as **pairwise image comparisons** on the [Datapoint](https://trydatapoint.com) annotation platform.
5. For each comparison, annotators evaluated **4 dimensions independently**: aesthetic appeal, typography, layout, and completeness.
6. **Display order was randomized** per serving to eliminate left/right position bias.
7. Each matchup received **15 independent judgments per dimension**.
## Dataset Structure
### `comparisons` (2,400 rows)
Each row is one aggregated comparison: one tool pair, one dimension, with screenshots, prompt text, and vote counts from 15 annotators.
| Column | Type | Description |
|--------|------|-------------|
| `image_a` | image | Full-page screenshot of tool_a's generated landing page |
| `image_b` | image | Full-page screenshot of tool_b's generated landing page |
| `tool_a` | string | First tool in the pair |
| `tool_b` | string | Second tool in the pair |
| `prompt_id` | int | Prompt ID (1-100) |
| `brand` | string | Business name from the prompt |
| `category` | string | Business category (e.g., "SaaS", "fintech", "restaurant") |
| `tone` | string | Design tone (e.g., "minimalist", "bold", "luxury") |
| `prompt` | string | Full prompt text used to generate the landing page |
| `dimension` | string | Evaluation dimension (see questions below) |
| `dimension_question` | string | The exact question annotators answered |
| `votes_a` | int | Number of annotators who preferred tool_a (out of 15) |
| `votes_b` | int | Number of annotators who preferred tool_b (out of 15) |
| `winner` | string | "A" (tool_a majority), "B" (tool_b majority), or "tie" |
### Evaluation Dimensions
Each comparison was evaluated on 4 independent dimensions. Annotators answered one question per dimension:
| Dimension | Question |
|-----------|----------|
| **aesthetic** | "Which design looks better at first glance?" |
| **typography** | "Which has better font choices, sizing, and readability?" |
| **layout** | "Which has better spacing, alignment, and visual flow?" |
| **completeness** | "Which has more fully-built sections with no empty or broken areas?" |
### `prompts` (100 rows)
| Column | Type | Description |
|--------|------|-------------|
| `id` | int | Prompt ID (1-100) |
| `category` | string | Business category |
| `tone` | string | Design tone |
| `prompt` | string | Full prompt text |
### `screenshots` (400 images)
Full-page screenshots of all generated landing pages (100 prompts x 4 tools), captured at 1440x900 viewport.
## Key Findings
### Overall Rankings (Bradley-Terry)
| Rank | Tool | Strength | 95% CI |
|------|------|----------|--------|
| 1 | Cursor | 0.271 | 0.265 - 0.277 |
| 2 | Claude | 0.269 | 0.263 - 0.274 |
| 3 | Lovable | 0.262 | 0.256 - 0.267 |
| 4 | Replit | 0.199 | 0.194 - 0.204 |
The top 3 tools are **statistically indistinguishable** (Cursor vs Claude: p = 1.0; Claude vs Lovable: p = 0.14). Replit is significantly behind (p < 0.001).
### Dimension Specialization
No single tool wins every dimension:
| Dimension | #1 | #2 | #3 | #4 |
|-----------|----|----|----|----|
| Aesthetic | Lovable | Cursor | Claude | Replit |
| Typography | Cursor | Claude | Lovable | Replit |
| Layout | Lovable | Claude | Cursor | Replit |
| Completeness | Claude | Cursor | Lovable | Replit |
### Category Specialization
- **Lovable** ranks #1 in 35/97 categories (consumer brands, lifestyle, ecommerce)
- **Claude** ranks #1 in 32/97 categories (professional services, enterprise, fintech)
- **Cursor** ranks #1 in 17/97 categories (SaaS, tech, agency)
- **Replit** ranks #1 in 13/97 categories (developer tools, compliance)
## Usage
```python
from datasets import load_dataset
# Load pairwise comparison judgments
comparisons = load_dataset("datapointai/vibe-landing-page-arena", "comparisons")
# Load prompts
prompts = load_dataset("datapointai/vibe-landing-page-arena", "prompts")
# Load screenshots
screenshots = load_dataset("datapointai/vibe-landing-page-arena", "screenshots")
```
### Reproduce the Bradley-Terry analysis
```python
import numpy as np
from scipy.optimize import minimize
# Count wins per tool pair
wins = {}
for row in comparisons["train"]:
a, b = row["tool_a"], row["tool_b"]
if row["choice"] == "A":
wins[(a, b)] = wins.get((a, b), 0) + 1
else:
wins[(b, a)] = wins.get((b, a), 0) + 1
# Fit Bradley-Terry model
tools = ["claude", "cursor", "lovable", "replit"]
idx = {t: i for i, t in enumerate(tools)}
def neg_log_likelihood(params):
nll = 0.0
for (a, b), count in wins.items():
p = 1.0 / (1.0 + np.exp(params[idx[b]] - params[idx[a]]))
nll -= count * np.log(max(p, 1e-10))
return nll
result = minimize(neg_log_likelihood, np.zeros(4),
method="L-BFGS-B", bounds=[(0,0)] + [(None,None)]*3)
strengths = np.exp(result.x) / np.exp(result.x).sum()
for tool, s in sorted(zip(tools, strengths), key=lambda x: -x[1]):
print(f"{tool}: {s:.4f}")
```
## Methodology
- **Ranking model:** Bradley-Terry with 1,000 bootstrap iterations for 95% confidence intervals
- **Significance testing:** Likelihood ratio tests between adjacent-ranked tools
- **Position bias:** Verified negligible via BT model with position parameter (delta = -0.03, CI crosses zero). Display order randomized per serving.
- **Annotator quality:** Platform uses calibration tasks with known gold-standard answers to compute annotator trust scores. 60% of calibrated annotators achieved perfect trust scores (1.0).
## Comparison to Related Work
| | This dataset | [Vibe Design Arena v1](https://huggingface.co/datasets/datapointai/vibe-design-arena) | Verita AI Study |
|---|---|---|---|
| Prompts | 100 (controlled) | 60 (real-world apps) | 80 (controlled) |
| Tools | 4 | 6 | 4 |
| Dimensions | 4 | 1 | 4 |
| Total judgments | 36,000 | ~53,000 | 1,260 |
| Annotators | 3,492 | unknown | 5 |
| Judgments per matchup | 15 per dimension | 30 | ~3 |
| Position randomization | Yes | Yes | Not reported |
| Statistical model | Bradley-Terry + bootstrap CI | Win rate | Bradley-Terry |
## License
CC-BY-4.0
## Citation
```bibtex
@dataset{vibe_landing_page_arena_2026,
title={Vibe Landing Page Arena: Human Preference Evaluation of AI-Generated Landing Page Design},
author={Datapoint AI},
year={2026},
url={https://huggingface.co/datasets/datapointai/vibe-landing-page-arena},
note={36,000 pairwise judgments across 4 tools, 100 prompts, and 4 design dimensions}
}
```
## Contact
Built by [Datapoint AI](https://trydatapoint.com). Questions or feedback: sales@trydatapoint.com
提供机构:
datapointai



