aligncast/speccast-benchmark

Name: aligncast/speccast-benchmark
Creator: aligncast
Published: 2026-04-09 03:07:23
License: 暂无描述

Hugging Face2026-04-09 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/aligncast/speccast-benchmark

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en license: cc-by-4.0 task_categories: - text-classification tags: - calibration - code-analysis - alignment - synthetic size_categories: - 10K<n<100K --- # AlignCast SpecCast Benchmark A synthetic benchmark for calibrated code-correctness forecasting. Each example contains a specification, an implementation, and a test suite. The task is to predict the probability that the implementation passes the test suite **without executing the code**. Ground-truth labels are produced by an offline oracle that executes each test suite in a sandboxed environment. ## Purpose This dataset supports research on: - **Calibration** of LLM-based code judgements - Detection of subtle behavioral deviations (off-by-one errors, predicate inversions, boundary swaps, etc.) - Evaluation under distribution shift (held-out templates and fault families) ## Columns | Column | Type | Description | | ------------------- | ---------- | -------------------------------------------------------- | | `id` | string | Unique example identifier (`ex00001` … `exN`) | | `template_id` | ClassLabel | Generator template (10 templates, 22 template×fault pairs in v1) | | `fault_family` | ClassLabel | Category of injected fault (9 families) | | `spec` | string | Natural-language specification | | `code` | string | Python implementation (may be correct or subtly faulty) | | `tests` | string | Python test suite | | `declared_correct` | int | Generator intent: `1` = correct variant, `0` = faulty | | `fault_manifested` | bool/null | `null` for correct variants; `true` if tests catch the fault; `false` if fault is present but tests pass | | `seed` | int | Global generator seed | | `label` | ClassLabel | 0 = fail, 1 = pass (ground truth from oracle execution) | | `num_failures` | int32 | Number of failing tests | | `runtime_ms` | float32 | Oracle execution time in milliseconds | | `oracle_error` | string/null| Error message from oracle execution, if any | ### Template IDs `clamp`, `count_vowels`, `find_first_gt`, `majority_vote`, `median3`, `parse_int_list`, `rotate_left`, `second_largest`, `sum_even`, `unique_sorted` ### Fault Families `boundary_swap`, `case_handling`, `comparison_weakening`, `missing_dedup`, `missing_tie_check`, `off_by_one`, `predicate_inversion`, `silent_error_masking`, `wrong_selection` ## Canonical Files | File | Examples | | --------------------------- | -------- | | `bench_10_seed42.jsonl` | 10 | | `bench_100_seed42.jsonl` | 100 | | `bench_1000_seed42.jsonl` | 1,000 | | `bench_10000_seed42.jsonl` | 10,000 | All files are generated deterministically with `seed=42`. Train/validation/test splits are not pre-assigned; researchers should create splits stratified by `template_id` and `fault_family` to support IID, template-held-out OOD, and fault-family-held-out OOD evaluation. See `benchgen_spec_canonical.md` for the normative benchmark definition and split guidance. ## Generation Examples are produced by a deterministic generator (`seed=42`). The v1.2 generator uses a sparse matrix of 22 (template, fault\_family) pairs across 10 templates and 9 fault families. The generator cycles through all 22 pairs, guaranteeing proportional coverage before shuffling. Whether each example is correct or faulty is determined by the per-example seeded RNG (≈50% each), so every (template, fault\_family) pair appears with both labels. See `benchgen_spec_canonical.md` for the normative benchmark definition and complete reproducibility specification. The oracle labels each example by executing the test suite with a 500ms timeout. ## Citation If you use this dataset, please cite the AlignCast project: ``` @misc{aligncast2026, title={AlignCast: Calibrated Forecasting of Code Correctness}, url={https://github.com/aligncast/aligncast}, year={2026} } ```

提供机构：

aligncast

5,000+

优质数据集

54 个

任务类型

进入经典数据集