ibm-research/STaD

Name: ibm-research/STaD
Creator: ibm-research
Published: 2026-04-22 12:13:08
License: 暂无描述

Hugging Face2026-04-22 更新2026-05-10 收录

下载链接：

https://hf-mirror.com/datasets/ibm-research/STaD

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 --- # Dataset Card for STaD Scaffolded Benchmarks ## Dataset Summary The **STaD Scaffolded Benchmarks** are diagnostic evaluation datasets introduced in the paper *"STaD: Scaffolded Task Design for Identifying Compositional Skill Gaps in LLMs"* (ACL Findings 2026). These benchmarks go beyond measuring whether a model answers correctly — they reveal **where and why** a model fails by breaking problems into sub-tasks and providing targeted scaffolding at specific reasoning steps. Each sample includes: - The original benchmark problem and answer - A decomposition into **sub-tasks**, each mapped to a specific reasoning skill - **Scaffolded variations** of the original problem — versions where partial solutions are provided up to a certain step — enabling pinpoint diagnosis of reasoning breakdowns - **Decomposed sub-questions** that isolate individual reasoning steps The benchmarks cover three source datasets: GSM8K, MATH-Hard, and Tree-of-Thought (ToT) Arithmetic. --- ## Datasets | File | Source Benchmark | # Samples | Description | |------|-----------------|-----------|-------------| | `gsm8k_scaffolded.jsonl` | GSM8K | 1,176 | Grade school math word problems with scaffolded variations | | `math_hard_scaffolded.jsonl` | MATH (Hard) | 773 | Competition-level math problems across 6 categories | | `tot_arithmetic_scaffolded.jsonl` | ToT Arithmetic | 1,448 | Multi-step arithmetic problems from Tree-of-Thought | **Total: 3,397 samples** --- ## Data Fields All three files share the following schema: | Field | Type | Description | |-------|------|-------------| | `id` | string | Unique identifier for the sample | | `question` | string | Original problem statement | | `answer` | string | Ground truth final answer | | `solution` | string | Full reference solution (MATH-Hard only) | | `category` | string | Problem category (MATH-Hard only; e.g., Algebra, Number Theory) | | `sub-task` | list | List of reasoning steps, each with a `segment` (description) and `skill` (reasoning skill label) | | `sub-task-answer` | list | Step-by-step answers with `explanation` and `answer` for each sub-task | | `scaffolding` | list | Scaffolded problem variants — each provides partial solutions up to step *k*, leaving the rest for the model | | `scaffolding_verification` | list | Verification data for scaffolded variants | | `decompositions` | list | Sub-questions that decompose the original problem into individual reasoning steps | --- ## Intended Uses - **Diagnostic evaluation**: Identify *where* in the reasoning chain a model breaks down, not just whether it fails - **Skill gap analysis**: Pinpoint which reasoning skills (e.g., modular arithmetic, multi-step algebra) a model lacks - **Compositional reasoning research**: Study how models handle problems requiring multiple skills in combination - **Benchmarking**: Compare LLMs at a fine-grained, sub-task level rather than aggregate accuracy --- ## How to Use To run evaluations using this dataset, refer to the code and instructions in the GitHub repository: **[https://github.com/ibm-granite/scaffolded-task-design](https://github.com/ibm-granite/scaffolded-task-design)** ```python from datasets import DatasetDict, load_dataset def load_STaD(): data_files = { "tot_arithmetic": "tot_arithmetic/tot_arithmetic.jsonl", "gsm8k": "gsm8k/gsm8k.jsonl", "math_hard": "math_hard/math_hard.jsonl", } return DatasetDict({ split: load_dataset("json", data_files={split: path})[split] for split, path in data_files.items() }) ``` ## Limitations - **Evaluation only**: With ~3K samples, these datasets are designed for evaluation and diagnostic purposes, not large-scale training. - **Math domain**: The current benchmarks focus on mathematical reasoning. Extension to other domains is left for future work. - **Scaffolding assumes step ordering**: The scaffolded variations assume a fixed sub-task order; alternative decompositions may yield different results. --- ## How to Cite If you use this dataset in your research, please cite: ```bibtex @inproceedings{an2026stad, title = {STaD: Scaffolded Task Design for Identifying Compositional Skill Gaps in LLMs}, author = {An, Sungeun and Kadhe, Swanand Ravindra and Thakur, Shailja and DeLuca, Chad and Patel, Hima}, year = {2026}, eprint = {2604.18177}, archivePrefix = {arXiv}, primaryClass = {cs.CL}, url={https://arxiv.org/abs/2604.18177} } ```

提供机构：

ibm-research

5,000+

优质数据集

54 个

任务类型

进入经典数据集