aligncast/speccast-benchmark
收藏Hugging Face2026-04-09 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/aligncast/speccast-benchmark
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: cc-by-4.0
task_categories:
- text-classification
tags:
- calibration
- code-analysis
- alignment
- synthetic
size_categories:
- 10K<n<100K
---
# AlignCast SpecCast Benchmark
A synthetic benchmark for calibrated code-correctness forecasting. Each example
contains a specification, an implementation, and a test suite. The task is to
predict the probability that the implementation passes the test suite **without
executing the code**.
Ground-truth labels are produced by an offline oracle that executes each test
suite in a sandboxed environment.
## Purpose
This dataset supports research on:
- **Calibration** of LLM-based code judgements
- Detection of subtle behavioral deviations (off-by-one errors, predicate
inversions, boundary swaps, etc.)
- Evaluation under distribution shift (held-out templates and fault families)
## Columns
| Column | Type | Description |
| ------------------- | ---------- | -------------------------------------------------------- |
| `id` | string | Unique example identifier (`ex00001` … `exN`) |
| `template_id` | ClassLabel | Generator template (10 templates, 22 template×fault pairs in v1) |
| `fault_family` | ClassLabel | Category of injected fault (9 families) |
| `spec` | string | Natural-language specification |
| `code` | string | Python implementation (may be correct or subtly faulty) |
| `tests` | string | Python test suite |
| `declared_correct` | int | Generator intent: `1` = correct variant, `0` = faulty |
| `fault_manifested` | bool/null | `null` for correct variants; `true` if tests catch the fault; `false` if fault is present but tests pass |
| `seed` | int | Global generator seed |
| `label` | ClassLabel | 0 = fail, 1 = pass (ground truth from oracle execution) |
| `num_failures` | int32 | Number of failing tests |
| `runtime_ms` | float32 | Oracle execution time in milliseconds |
| `oracle_error` | string/null| Error message from oracle execution, if any |
### Template IDs
`clamp`, `count_vowels`, `find_first_gt`, `majority_vote`, `median3`,
`parse_int_list`, `rotate_left`, `second_largest`, `sum_even`, `unique_sorted`
### Fault Families
`boundary_swap`, `case_handling`, `comparison_weakening`, `missing_dedup`,
`missing_tie_check`, `off_by_one`, `predicate_inversion`,
`silent_error_masking`, `wrong_selection`
## Canonical Files
| File | Examples |
| --------------------------- | -------- |
| `bench_10_seed42.jsonl` | 10 |
| `bench_100_seed42.jsonl` | 100 |
| `bench_1000_seed42.jsonl` | 1,000 |
| `bench_10000_seed42.jsonl` | 10,000 |
All files are generated deterministically with `seed=42`. Train/validation/test
splits are not pre-assigned; researchers should create splits stratified by
`template_id` and `fault_family` to support IID, template-held-out OOD, and
fault-family-held-out OOD evaluation. See `benchgen_spec_canonical.md` for the
normative benchmark definition and split guidance.
## Generation
Examples are produced by a deterministic generator (`seed=42`). The v1.2 generator
uses a sparse matrix of 22 (template, fault\_family) pairs across 10 templates and
9 fault families. The generator cycles through all 22 pairs, guaranteeing
proportional coverage before shuffling. Whether each example is correct or faulty
is determined by the per-example seeded RNG (≈50% each), so every (template,
fault\_family) pair appears with both labels. See `benchgen_spec_canonical.md` for
the normative benchmark definition and complete reproducibility specification.
The oracle labels each example by executing the test suite with a 500ms timeout.
## Citation
If you use this dataset, please cite the AlignCast project:
```
@misc{aligncast2026,
title={AlignCast: Calibrated Forecasting of Code Correctness},
url={https://github.com/aligncast/aligncast},
year={2026}
}
```
提供机构:
aligncast



