henggg/paradigm-bench
收藏Hugging Face2026-04-09 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/henggg/paradigm-bench
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
task_categories:
- question-answering
- text-generation
language:
- en
pretty_name: PARADIGM Bench
size_categories:
- n<1K
---
# PARADIGM Benchmark Suite
This dataset contains the sampled subset of 10 benchmarks used in the paper
**"Select-then-Solve: Paradigm Routing as Inference-Time Optimization for Language Agents"**
(under review at CoLM 2026).
## Overview
We evaluate six inference-time reasoning paradigms (Direct, CoT, ReAct, Plan-Execute,
Reflection, ReCode) on this fixed sampled set across four frontier LLMs, yielding
roughly 18,000 task-paradigm-model combinations.
## Datasets
| Dataset | Domain | # Examples | Notes |
|---------|--------|-----------|-------|
| humaneval | — | 100 | seed=42 |
| math500 | — | 100 | seed=42 |
| aime | — | 60 | seed=42 |
| hotpotqa | — | 100 | seed=42 |
| nq | — | 100 | seed=42 |
| mmlu | — | 100 | seed=42 |
| hle | — | 50 | seed=42 |
| gaia | — | 50 | seed=42 |
| tau_bench | — | 51 | seed=42 |
| seal | — | 50 | seed=42 |
**Total examples per model-paradigm pair: 761**
## Sampling Protocol
- For large legacy benchmarks (HumanEval, MATH500, HotpotQA, NQ, MMLU), we sample
a fixed subset using `random.Random(42).sample(tasks, sample_size)`.
- For smaller benchmarks (AIME, HLE, GAIA, SEAL, $\tau$-bench), we use the full
curated set or near-full samples.
## Format
Each line in `<dataset>/test.jsonl` is a JSON object with:
- `id`: unique task identifier
- `question`: the prompt / question text
- `ground_truth`: the reference answer
- `dataset`: dataset name (for cross-validation)
- `metadata`: dataset-specific extra fields (e.g., `entry_point` for HumanEval,
`test` cases, choices for MMLU, etc.)
## Citation
```bibtex
@inproceedings{paradigm2026,
title={Select-then-Solve: Paradigm Routing as Inference-Time Optimization for Language Agents},
author={Anonymous},
booktitle={Conference on Language Modeling},
year={2026}
}
```
提供机构:
henggg



