DJLougen/wittgensite

Name: DJLougen/wittgensite
Creator: DJLougen
Published: 2026-04-10 13:21:15
License: 暂无描述

Hugging Face2026-04-10 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/DJLougen/wittgensite

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-nc-sa-4.0 task_categories: - text-generation language: - en tags: - benchmark - code-generation - prompt-consistency - ai-agents - coding-agents - evaluation - wittgenstein - semantic-invariance - caduceus pretty_name: "WittgenSite: Prompt Consistency Benchmark" size_categories: - n<1K configs: - config_name: prompts data_files: - split: test path: prompts.jsonl --- # WittgenSite: A Prompt Consistency Benchmark for AI Coding Agents **Created by [Daniel Lougen](https://huggingface.co/DJLougen)** Most benchmarks test whether an agent *can* complete a task. WittgenSite tests whether an agent produces **the same output regardless of how you ask**. Inspired by Wittgenstein's insight that meaning is determined by use — this benchmark measures whether AI coding agents extract the same meaning from semantically equivalent prompts. ## Benchmark Design One locked specification. 100 semantically different prompts. Same task every time. Score = consistency across runs. The agent builds a 5-page SaaS website from `GOLDEN-SPEC.md` (vanilla HTML + Tailwind CDN). Each run uses a different prompt from `PROMPTS.md`. The outputs are compared for structural, textual, behavioral, and stylistic consistency. ## Prompt Categories | Category | Prompts | Tests | |----------|---------|-------| | Direct & Minimal | 1-25 | Baseline consistency with simple instructions | | Role / Persona Based | 26-50 | Whether persona framing ("you are a senior dev") causes drift | | Verbose / Detailed | 51-75 | Whether extra detail causes additions or changes | | Casual & Red Herring | 76-100 | Whether suggestive language ("make it premium") causes deviation | ## Scoring ### Per-Run: Spec Fidelity (7 dimensions) | Dimension | Weight | |-----------|--------| | Structure & Files | 20% | | Copy Fidelity | 15% | | Theme System | 15% | | Accessibility | 15% | | Responsive Layout | 10% | | Interactivity | 15% | | Code Quality | 10% | ### Cross-Run: Consistency (5 dimensions) | Dimension | Weight | |-----------|--------| | Structural Consistency | 30% | | Copy Consistency | 25% | | Behavioral Consistency | 20% | | Style Consistency | 15% | | Exact Match Rate | 10% | ### Interpretation | Score | Meaning | |-------|---------| | 90-100 | Excellent — near-deterministic output | | 75-89 | Good — minor cosmetic drift | | 50-74 | Moderate — prompt wording affects output | | < 50 | Poor — output depends heavily on phrasing | ## Preliminary Results 2 runs tested (Prompt #1 direct vs. Prompt #26 role-based), same model: - Per-run spec fidelity: **100/100** and **99.5/100** - Cross-run consistency: **31.9/100 (Poor)** Both runs built functional websites that matched the spec individually. But the implementations were structurally different — different Tailwind classes, different JS patterns, different HTML nesting. Adding "You are a senior frontend developer" to the prompt changed the output significantly. ## Statistical Power Analysis How many runs do you need for reliable results? A formal power analysis (assuming SD ≈ 15 on the 0–100 consistency scale, based on pilot data): ### Estimating Overall Consistency | Desired Precision (95% CI) | Runs Needed | |----------------------------|-------------| | ±10 points | 9 | | ±7 points | 18 | | ±5 points | 35 | | ±3 points | 97 | ### Detecting Category Effects (ANOVA) Can prompt style (direct vs. persona vs. verbose vs. casual) cause consistency drift? | Runs per Category | Power (medium effect) | Power (large effect) | |-------------------|----------------------|---------------------| | 10 | 21% | 50% | | 15 | 32% | 71% | | 20 | 42% | 85% | | 25 (all) | 52% | 92% | ### Comparing Two Models | Runs per Model | Power (medium effect, d=0.5) | Power (large effect, d=0.8) | |----------------|------------------------------|----------------------------| | 20 | 34% | 69% | | 30 | 48% | 86% | | 50 | 70% | 98% | ### Recommended Run Counts | Goal | Minimum | Recommended | |------|---------|-------------| | Quick estimate of overall consistency | 9 runs (random sample) | 35 runs | | Per-category breakdown | 10 per category (40 total) | All 25 per category (100 total) | | Model-vs-model comparison | 30 runs each | 50 runs each | | Full benchmark (leaderboard submission) | All 100 prompts | All 100 prompts | > **Note:** Pilot data (2 runs, consistency = 31.9/100) suggests high variance across implementations. With only 25 prompts per category, **running all 100 is strongly recommended** for reliable per-category claims. ## Submission Requirements To submit results to the [WittgenSite Leaderboard](https://huggingface.co/spaces/DJLougen/Wittgensite-leaderboard), your submission must include: ### Required - **All 100 prompts executed** — partial runs are not accepted for leaderboard ranking - **Fresh context per run** — each prompt must start a new agent session with no memory of prior runs - **5 HTML output files per run** saved to numbered directories (`runs/001/` through `runs/100/`) - **Model identification** — exact model name, version, and provider (e.g., `claude-sonnet-4-20250514`, `gpt-4o-2024-08-06`) - **Agent framework** — tool/framework used (e.g., Claude Code, Cursor, Aider, custom harness) - **Scoring outputs** — results from both `scoring/evaluate.py` and `scoring/consistency.py` - **Temperature setting** — must be reported; default/recommended is the agent's default ### Submission Format ``` submission/ ├── meta.json # Model, agent, temperature, date, submitter ├── runs/ │ ├── 001/ # Prompt 1 output │ │ ├── home.html │ │ ├── features.html │ │ ├── pricing.html │ │ ├── about.html │ │ └── app.html │ ├── 002/ # Prompt 2 output │ │ └── ... │ └── 100/ │ └── ... ├── scores/ │ ├── fidelity.json # Per-run spec fidelity scores │ └── consistency.json # Cross-run consistency scores └── logs/ # Optional: raw agent logs for reproducibility ``` ### meta.json Example ```json { "model": "claude-sonnet-4-20250514", "agent": "Claude Code v1.0.0", "temperature": "default", "date": "2026-04-10", "submitter": "DJLougen", "notes": "Run on macOS, no custom system prompt" } ``` ### Disqualification Criteria - Runs sharing context or memory between prompts - Manual editing of output files - Cherry-picking a subset of prompts - Using the scoring scripts to iteratively improve outputs ## Limitations - **Subsampling is insufficient for category-level conclusions.** With only 25 prompts per category and expected high variance in consistency scores (SD ≈ 15), running fewer than all 25 prompts in a category does not provide enough statistical power to make reliable claims about whether a specific prompt style (e.g., persona-based vs. direct) causes meaningful drift. Full 100-prompt runs are required for per-category analysis. - **Pilot data is limited.** Current power estimates are based on 2 pilot runs. As more submissions arrive, the assumed variance (and therefore sample size recommendations) may be revised. - **Single-spec benchmark.** Results reflect consistency on one specific website specification. Generalization to other tasks or domains is not established. - **Scoring is automated, not human-judged.** The fidelity and consistency scorers use structural/textual comparison, which may miss semantic equivalences (e.g., two visually identical implementations using different CSS approaches). ## Files | File | Description | |------|-------------| | `GOLDEN-SPEC.md` | Locked website specification (source of truth) | | `PROMPTS.md` | 100 semantically diverse prompts across 4 categories | | `scoring/evaluate.py` | Per-run spec fidelity scorer | | `scoring/consistency.py` | Cross-run consistency scorer (primary metric) | | `task.json` | Caduceus benchmark metadata | ## Usage ```bash # 1. Run agent with fresh context + golden spec + one prompt # 2. Save output to runs/001/, runs/002/, etc. # Score individual run python scoring/evaluate.py runs/001/ # Score consistency across all runs python scoring/consistency.py runs/ ``` ## Citation ```bibtex @misc{lougen2026wittgensite, title={WittgenSite: A Prompt Consistency Benchmark for AI Coding Agents}, author={Lougen, Daniel}, year={2026}, url={https://github.com/DJLougen/wittgensite}, note={Inspired by Wittgenstein's philosophy of language} } ``` ## License **CC BY-NC-SA 4.0** — Attribution required, non-commercial, share-alike. Commercial use requires explicit written permission from [Daniel Lougen](https://x.com/DJLougen). ## Links - **Leaderboard**: [DJLougen/Wittgensite-leaderboard](https://huggingface.co/spaces/DJLougen/Wittgensite-leaderboard) - **GitHub**: [DJLougen/wittgensite](https://github.com/DJLougen/wittgensite) - **Caduceus Task Page**: [djlougen.github.io/caduceus/tasks/T014](https://djlougen.github.io/caduceus/tasks/T014) - **Author**: [@DJLougen](https://x.com/DJLougen)

提供机构：

DJLougen

5,000+

优质数据集

54 个

任务类型

进入经典数据集