DJLougen/wittgensite
收藏Hugging Face2026-04-10 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/DJLougen/wittgensite
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-nc-sa-4.0
task_categories:
- text-generation
language:
- en
tags:
- benchmark
- code-generation
- prompt-consistency
- ai-agents
- coding-agents
- evaluation
- wittgenstein
- semantic-invariance
- caduceus
pretty_name: "WittgenSite: Prompt Consistency Benchmark"
size_categories:
- n<1K
configs:
- config_name: prompts
data_files:
- split: test
path: prompts.jsonl
---
# WittgenSite: A Prompt Consistency Benchmark for AI Coding Agents
**Created by [Daniel Lougen](https://huggingface.co/DJLougen)**
Most benchmarks test whether an agent *can* complete a task. WittgenSite tests whether an agent produces **the same output regardless of how you ask**.
Inspired by Wittgenstein's insight that meaning is determined by use — this benchmark measures whether AI coding agents extract the same meaning from semantically equivalent prompts.
## Benchmark Design
One locked specification. 100 semantically different prompts. Same task every time. Score = consistency across runs.
The agent builds a 5-page SaaS website from `GOLDEN-SPEC.md` (vanilla HTML + Tailwind CDN). Each run uses a different prompt from `PROMPTS.md`. The outputs are compared for structural, textual, behavioral, and stylistic consistency.
## Prompt Categories
| Category | Prompts | Tests |
|----------|---------|-------|
| Direct & Minimal | 1-25 | Baseline consistency with simple instructions |
| Role / Persona Based | 26-50 | Whether persona framing ("you are a senior dev") causes drift |
| Verbose / Detailed | 51-75 | Whether extra detail causes additions or changes |
| Casual & Red Herring | 76-100 | Whether suggestive language ("make it premium") causes deviation |
## Scoring
### Per-Run: Spec Fidelity (7 dimensions)
| Dimension | Weight |
|-----------|--------|
| Structure & Files | 20% |
| Copy Fidelity | 15% |
| Theme System | 15% |
| Accessibility | 15% |
| Responsive Layout | 10% |
| Interactivity | 15% |
| Code Quality | 10% |
### Cross-Run: Consistency (5 dimensions)
| Dimension | Weight |
|-----------|--------|
| Structural Consistency | 30% |
| Copy Consistency | 25% |
| Behavioral Consistency | 20% |
| Style Consistency | 15% |
| Exact Match Rate | 10% |
### Interpretation
| Score | Meaning |
|-------|---------|
| 90-100 | Excellent — near-deterministic output |
| 75-89 | Good — minor cosmetic drift |
| 50-74 | Moderate — prompt wording affects output |
| < 50 | Poor — output depends heavily on phrasing |
## Preliminary Results
2 runs tested (Prompt #1 direct vs. Prompt #26 role-based), same model:
- Per-run spec fidelity: **100/100** and **99.5/100**
- Cross-run consistency: **31.9/100 (Poor)**
Both runs built functional websites that matched the spec individually. But the implementations were structurally different — different Tailwind classes, different JS patterns, different HTML nesting. Adding "You are a senior frontend developer" to the prompt changed the output significantly.
## Statistical Power Analysis
How many runs do you need for reliable results? A formal power analysis (assuming SD ≈ 15 on the 0–100 consistency scale, based on pilot data):
### Estimating Overall Consistency
| Desired Precision (95% CI) | Runs Needed |
|----------------------------|-------------|
| ±10 points | 9 |
| ±7 points | 18 |
| ±5 points | 35 |
| ±3 points | 97 |
### Detecting Category Effects (ANOVA)
Can prompt style (direct vs. persona vs. verbose vs. casual) cause consistency drift?
| Runs per Category | Power (medium effect) | Power (large effect) |
|-------------------|----------------------|---------------------|
| 10 | 21% | 50% |
| 15 | 32% | 71% |
| 20 | 42% | 85% |
| 25 (all) | 52% | 92% |
### Comparing Two Models
| Runs per Model | Power (medium effect, d=0.5) | Power (large effect, d=0.8) |
|----------------|------------------------------|----------------------------|
| 20 | 34% | 69% |
| 30 | 48% | 86% |
| 50 | 70% | 98% |
### Recommended Run Counts
| Goal | Minimum | Recommended |
|------|---------|-------------|
| Quick estimate of overall consistency | 9 runs (random sample) | 35 runs |
| Per-category breakdown | 10 per category (40 total) | All 25 per category (100 total) |
| Model-vs-model comparison | 30 runs each | 50 runs each |
| Full benchmark (leaderboard submission) | All 100 prompts | All 100 prompts |
> **Note:** Pilot data (2 runs, consistency = 31.9/100) suggests high variance across implementations. With only 25 prompts per category, **running all 100 is strongly recommended** for reliable per-category claims.
## Submission Requirements
To submit results to the [WittgenSite Leaderboard](https://huggingface.co/spaces/DJLougen/Wittgensite-leaderboard), your submission must include:
### Required
- **All 100 prompts executed** — partial runs are not accepted for leaderboard ranking
- **Fresh context per run** — each prompt must start a new agent session with no memory of prior runs
- **5 HTML output files per run** saved to numbered directories (`runs/001/` through `runs/100/`)
- **Model identification** — exact model name, version, and provider (e.g., `claude-sonnet-4-20250514`, `gpt-4o-2024-08-06`)
- **Agent framework** — tool/framework used (e.g., Claude Code, Cursor, Aider, custom harness)
- **Scoring outputs** — results from both `scoring/evaluate.py` and `scoring/consistency.py`
- **Temperature setting** — must be reported; default/recommended is the agent's default
### Submission Format
```
submission/
├── meta.json # Model, agent, temperature, date, submitter
├── runs/
│ ├── 001/ # Prompt 1 output
│ │ ├── home.html
│ │ ├── features.html
│ │ ├── pricing.html
│ │ ├── about.html
│ │ └── app.html
│ ├── 002/ # Prompt 2 output
│ │ └── ...
│ └── 100/
│ └── ...
├── scores/
│ ├── fidelity.json # Per-run spec fidelity scores
│ └── consistency.json # Cross-run consistency scores
└── logs/ # Optional: raw agent logs for reproducibility
```
### meta.json Example
```json
{
"model": "claude-sonnet-4-20250514",
"agent": "Claude Code v1.0.0",
"temperature": "default",
"date": "2026-04-10",
"submitter": "DJLougen",
"notes": "Run on macOS, no custom system prompt"
}
```
### Disqualification Criteria
- Runs sharing context or memory between prompts
- Manual editing of output files
- Cherry-picking a subset of prompts
- Using the scoring scripts to iteratively improve outputs
## Limitations
- **Subsampling is insufficient for category-level conclusions.** With only 25 prompts per category and expected high variance in consistency scores (SD ≈ 15), running fewer than all 25 prompts in a category does not provide enough statistical power to make reliable claims about whether a specific prompt style (e.g., persona-based vs. direct) causes meaningful drift. Full 100-prompt runs are required for per-category analysis.
- **Pilot data is limited.** Current power estimates are based on 2 pilot runs. As more submissions arrive, the assumed variance (and therefore sample size recommendations) may be revised.
- **Single-spec benchmark.** Results reflect consistency on one specific website specification. Generalization to other tasks or domains is not established.
- **Scoring is automated, not human-judged.** The fidelity and consistency scorers use structural/textual comparison, which may miss semantic equivalences (e.g., two visually identical implementations using different CSS approaches).
## Files
| File | Description |
|------|-------------|
| `GOLDEN-SPEC.md` | Locked website specification (source of truth) |
| `PROMPTS.md` | 100 semantically diverse prompts across 4 categories |
| `scoring/evaluate.py` | Per-run spec fidelity scorer |
| `scoring/consistency.py` | Cross-run consistency scorer (primary metric) |
| `task.json` | Caduceus benchmark metadata |
## Usage
```bash
# 1. Run agent with fresh context + golden spec + one prompt
# 2. Save output to runs/001/, runs/002/, etc.
# Score individual run
python scoring/evaluate.py runs/001/
# Score consistency across all runs
python scoring/consistency.py runs/
```
## Citation
```bibtex
@misc{lougen2026wittgensite,
title={WittgenSite: A Prompt Consistency Benchmark for AI Coding Agents},
author={Lougen, Daniel},
year={2026},
url={https://github.com/DJLougen/wittgensite},
note={Inspired by Wittgenstein's philosophy of language}
}
```
## License
**CC BY-NC-SA 4.0** — Attribution required, non-commercial, share-alike.
Commercial use requires explicit written permission from [Daniel Lougen](https://x.com/DJLougen).
## Links
- **Leaderboard**: [DJLougen/Wittgensite-leaderboard](https://huggingface.co/spaces/DJLougen/Wittgensite-leaderboard)
- **GitHub**: [DJLougen/wittgensite](https://github.com/DJLougen/wittgensite)
- **Caduceus Task Page**: [djlougen.github.io/caduceus/tasks/T014](https://djlougen.github.io/caduceus/tasks/T014)
- **Author**: [@DJLougen](https://x.com/DJLougen)
提供机构:
DJLougen



