five

henggg/paradigm-bench

收藏
Hugging Face2026-04-09 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/henggg/paradigm-bench
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit task_categories: - question-answering - text-generation language: - en pretty_name: PARADIGM Bench size_categories: - n<1K --- # PARADIGM Benchmark Suite This dataset contains the sampled subset of 10 benchmarks used in the paper **"Select-then-Solve: Paradigm Routing as Inference-Time Optimization for Language Agents"** (under review at CoLM 2026). ## Overview We evaluate six inference-time reasoning paradigms (Direct, CoT, ReAct, Plan-Execute, Reflection, ReCode) on this fixed sampled set across four frontier LLMs, yielding roughly 18,000 task-paradigm-model combinations. ## Datasets | Dataset | Domain | # Examples | Notes | |---------|--------|-----------|-------| | humaneval | — | 100 | seed=42 | | math500 | — | 100 | seed=42 | | aime | — | 60 | seed=42 | | hotpotqa | — | 100 | seed=42 | | nq | — | 100 | seed=42 | | mmlu | — | 100 | seed=42 | | hle | — | 50 | seed=42 | | gaia | — | 50 | seed=42 | | tau_bench | — | 51 | seed=42 | | seal | — | 50 | seed=42 | **Total examples per model-paradigm pair: 761** ## Sampling Protocol - For large legacy benchmarks (HumanEval, MATH500, HotpotQA, NQ, MMLU), we sample a fixed subset using `random.Random(42).sample(tasks, sample_size)`. - For smaller benchmarks (AIME, HLE, GAIA, SEAL, $\tau$-bench), we use the full curated set or near-full samples. ## Format Each line in `<dataset>/test.jsonl` is a JSON object with: - `id`: unique task identifier - `question`: the prompt / question text - `ground_truth`: the reference answer - `dataset`: dataset name (for cross-validation) - `metadata`: dataset-specific extra fields (e.g., `entry_point` for HumanEval, `test` cases, choices for MMLU, etc.) ## Citation ```bibtex @inproceedings{paradigm2026, title={Select-then-Solve: Paradigm Routing as Inference-Time Optimization for Language Agents}, author={Anonymous}, booktitle={Conference on Language Modeling}, year={2026} } ```
提供机构:
henggg
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作