henggg/paradigm-bench

Name: henggg/paradigm-bench
Creator: henggg
Published: 2026-04-09 06:07:16
License: 暂无描述

Hugging Face2026-04-09 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/henggg/paradigm-bench

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit task_categories: - question-answering - text-generation language: - en pretty_name: PARADIGM Bench size_categories: - n<1K --- # PARADIGM Benchmark Suite This dataset contains the sampled subset of 10 benchmarks used in the paper **"Select-then-Solve: Paradigm Routing as Inference-Time Optimization for Language Agents"** (under review at CoLM 2026). ## Overview We evaluate six inference-time reasoning paradigms (Direct, CoT, ReAct, Plan-Execute, Reflection, ReCode) on this fixed sampled set across four frontier LLMs, yielding roughly 18,000 task-paradigm-model combinations. ## Datasets | Dataset | Domain | # Examples | Notes | |---------|--------|-----------|-------| | humaneval | — | 100 | seed=42 | | math500 | — | 100 | seed=42 | | aime | — | 60 | seed=42 | | hotpotqa | — | 100 | seed=42 | | nq | — | 100 | seed=42 | | mmlu | — | 100 | seed=42 | | hle | — | 50 | seed=42 | | gaia | — | 50 | seed=42 | | tau_bench | — | 51 | seed=42 | | seal | — | 50 | seed=42 | **Total examples per model-paradigm pair: 761** ## Sampling Protocol - For large legacy benchmarks (HumanEval, MATH500, HotpotQA, NQ, MMLU), we sample a fixed subset using `random.Random(42).sample(tasks, sample_size)`. - For smaller benchmarks (AIME, HLE, GAIA, SEAL, $\tau$-bench), we use the full curated set or near-full samples. ## Format Each line in `<dataset>/test.jsonl` is a JSON object with: - `id`: unique task identifier - `question`: the prompt / question text - `ground_truth`: the reference answer - `dataset`: dataset name (for cross-validation) - `metadata`: dataset-specific extra fields (e.g., `entry_point` for HumanEval, `test` cases, choices for MMLU, etc.) ## Citation ```bibtex @inproceedings{paradigm2026, title={Select-then-Solve: Paradigm Routing as Inference-Time Optimization for Language Agents}, author={Anonymous}, booktitle={Conference on Language Modeling}, year={2026} } ```

提供机构：

henggg

5,000+

优质数据集

54 个

任务类型

进入经典数据集