five

banyaaiofficial/prototypebench-v1

收藏
Hugging Face2026-04-21 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/banyaaiofficial/prototypebench-v1
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit language: - en task_categories: - text-generation - text2text-generation pretty_name: PrototypeBench v0.1 tags: - benchmark - llm-evaluation - llm-benchmark - coding-agent - agent-evaluation - swe-bench - fastapi - python - software-engineering - execution-based-evaluation - rlvr size_categories: - n<1K source_datasets: - original configs: - config_name: default data_files: - split: test path: "instances.jsonl" dataset_info: features: - name: instance_id dtype: string - name: repo dtype: string - name: pr_number dtype: int32 - name: pr_url dtype: string - name: pr_title dtype: string - name: base_commit dtype: string - name: head_commit dtype: string - name: problem_statement dtype: string - name: patch dtype: string - name: test_patch dtype: string - name: stack_domain dtype: string - name: contamination_tier dtype: string - name: created_at dtype: string - name: schema_version dtype: string --- # PrototypeBench v0.1 > **Can your agent ship a full-stack AI-native prototype?** PrototypeBench is an open benchmark for evaluating AI coding agents on **full-stack feature shipping**. Where SWE-Bench measures bug-fixing in mature Python libraries, PrototypeBench measures *"can the agent ship a full-stack feature on a modern AI-native stack?"* - **Project home**: https://github.com/prototypebench/prototypebench - **Website**: https://prototypebench.org - **License**: MIT - **Version**: v0.1 (initial corpus) - **Language**: English (problem statements), Python (backend code), TypeScript/JavaScript (frontend code, future) ## Dataset Summary 71 **PR-mined task instances** from active open-source repositories, each shaped for SWE-Bench-compatible execution-based scoring: | Stat | Value | |---|---:| | Total instances | **71** | | Sources | 2 (`fastapi/full-stack-fastapi-template`, `IBM/mcp-context-forge`) | | `FAIL_TO_PASS` tests | 689 | | `PASS_TO_PASS` regression-guard tests | 31,644 | | Total test cases per full eval | **32,333** | | stack_domain | 71 backend_only (v0.1); frontend & fullstack in later versions | | contamination_tier | 71 held_out (all post-2026-01-01) | | Schema version | 0.1 | **Comparison**: SWE-Bench Verified has 500 instances, SWE-Bench Lite 300, HumanEval 164. v1 public-beta targets 200–300. ## Scoring Execution-based binary scoring (no LLM-as-judge): ``` score(instance) = 1 iff FAIL_TO_PASS ⊆ passing_tests AND PASS_TO_PASS ⊆ passing_tests (no regression) 0 otherwise ``` **Judge**: `pytest` (backend) and `Playwright` (frontend, future). **Ground truth** = the actual merged PR diff (hidden from the agent). See the [methodology notes](https://github.com/prototypebench/prototypebench/blob/main/PLAN.md#52-오염-대응-contamination-mitigation). ## Usage ```python from datasets import load_dataset ds = load_dataset("banyaaiofficial/prototypebench-v1", split="test") for item in ds: print(item["instance_id"]) # e.g. "IBM__mcp-context-forge-4270" print(item["problem_statement"]) # NL task spec (PR body or closing issue) base_sha = item["base_commit"] # pre-PR commit — agent starts here # Agent produces a non-test unified diff against base_sha. # Score it with the companion harness: # pbench score --source <short> --pr <N> --patch-file agent_patch.diff ``` Each instance extends the SWE-Bench `instances.jsonl` schema with dual-test fields (`fail_to_pass.backend` / `.frontend`, `test_patch_backend` / `.frontend`) for future Playwright integration. Full schema: https://github.com/prototypebench/prototypebench/blob/main/schemas/task_instance.schema.json ## Source Composition | Source | Stars | License | Instances | F2P | P2P | |---|---:|---|---:|---:|---:| | [`fastapi/full-stack-fastapi-template`](https://github.com/fastapi/full-stack-fastapi-template) | 42.7k | MIT | 3 | 7 | 77 | | [`IBM/mcp-context-forge`](https://github.com/IBM/mcp-context-forge) | 3.6k | Apache-2 | 68 | 682 | 31,567 | All PRs are **merged PRs with maintainer-reviewed tests**. Task instances mine the natural atomic unit of change (one feature or fix at a time). ## Data Fields See the task-schema doc for full field-by-field semantics. Highlights: - `instance_id` — stable unique ID (`owner__repo-<pr_number>`) - `base_commit` / `head_commit` — SHAs bounding the reference change - `problem_statement` — natural-language task spec (from closing issue body, else PR description) - `patch` — reference solution (non-test diff). **Hidden from the agent at evaluation time.** - `test_patch` — test-only diff that the harness applies before running the agent's patch - `fail_to_pass` — `{backend: [...], frontend: [...]}` — tests the agent must make pass - `pass_to_pass` — `{backend: [...], frontend: [...]}` — regression-guard tests (must not break) - `stack_domain` — `backend_only` | `frontend_only` | `fullstack` - `environment` — python_version, node_version, uv_lock_sha, etc. for reproducible builds - `contamination_tier` — `public` | `held_out` | `internal_only` ## Contamination & Fairness - **Held-out by construction**: all v0.1 instances are merged after 2026-01-01 (Claude Opus 4.7 cutoff). Submitters must disclose their model cutoff for point-count adjustment. - **Rotation**: held-out tier is rotated per leaderboard season (Phase 5). - **No vendor branding**: benchmark carries no vendor name. Hosted on `banyaaiofficial` for convenience only; the benchmark is project-neutral. ## Limitations - v0.1 is backend-only (no Playwright scoring yet — the harness supports it but frontend-kind PRs are v1+). - mcp-context-forge 68 instances dominate the corpus — diverse workload coverage is a v1+ priority. - "test strength = benchmark quality": PRs with weak tests are filtered but not perfectly. Curator review recommended. - Execution-based scoring requires running tests (not instantaneous) — see the harness for Docker-based reproducible runs. ## Related Benchmarks - [SWE-Bench](https://www.swebench.com/) — Python library bug-fixes (2,294 instances). PrototypeBench extends the pattern to modern AI-native full-stack apps. - [SWE-Bench Lite / Verified](https://www.swebench.com/lite.html) — curated subsets. - [Terminal-Bench](https://www.tbench.ai/) — CLI tasks. - [BigCodeBench](https://bigcode-bench.github.io/) — library-usage function-level tasks. ## Citation Citation format will be fixed at Phase 4 public launch. For now: ``` @misc{prototypebench_v01, title = {PrototypeBench v0.1: An AI-native Full-Stack Coding Agent Benchmark}, year = {2026}, url = {https://github.com/prototypebench/prototypebench}, note = {71 instances across 2 source repos; execution-based scoring} } ``` ## Changelog - **v0.1** (2026-04-20): initial corpus. 71 backend_only instances, all held_out. Schema v0.1.
提供机构:
banyaaiofficial
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作