dart-math-pool-gsm8k-query-info
收藏魔搭社区2025-12-05 更新2025-02-22 收录
下载链接:
https://modelscope.cn/datasets/hkust-nlp/dart-math-pool-gsm8k-query-info
下载链接
链接失效反馈官方服务:
资源简介:
> [!NOTE]
> This dataset is the **synthesis information of queries** from the **GSM8K** training set,
> such as the numbers of raw/correct samples of each synthesis job.
> Usually used with `dart-math-pool-gsm8k`.
# 🎯 DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving
📝 [Paper@arXiv](https://arxiv.org/abs/2407.13690) | 🤗 [Datasets&Models@HF](https://huggingface.co/collections/hkust-nlp/dart-math-665704599b35de59f8fdf6c1) | 🐱 [Code@GitHub](https://github.com/hkust-nlp/dart-math)
🐦 [Thread@X(Twitter)](https://x.com/tongyx361/status/1811413243350454455) | 🐶 [中文博客@知乎](https://zhuanlan.zhihu.com/p/708371895) | 📊 [Leaderboard@PapersWithCode](https://paperswithcode.com/paper/dart-math-difficulty-aware-rejection-tuning#results) | 📑 [BibTeX](https://github.com/hkust-nlp/dart-math?tab=readme-ov-file#citation)
## Datasets: `DART-Math`
`DART-Math` datasets are the **state-of-the-art** and **data-efficient** **open-source** instruction tuning datasets for mathematical reasoning.
<style>
.container {
display: flex;
justify-content: space-around;
}
.container img {
max-width: 45%;
height: auto;
}
.caption {
text-align: center;
font-size: small;
margin-top: 10px;
}
</style>
<div class="container">
<img src="https://tongyx361.github.io/assets/dart-math/main-results.png" alt="Main results averaged on 2 in-domain and 4 challenging out-of-domain mathematical reasoning benchmarks.">
<img src="https://tongyx361.github.io/assets/dart-math/main-nresp-vs-query.png" alt="Number of responses v.s. query descending in difficulty in DART-Math datasets and similar-sized VRT baseline.">
</div>
<div class="caption">
Figure 1: <strong>Left:</strong> Average accuracy on 6 mathematical benchmarks. We compare with models fine-tuned on the best, public instruction tuning datasets for mathematical problem-solving:
MetaMath <a href="https://openreview.net/forum?id=N8N0hgNDRt">(Yu et al., 2024)</a> with 395K
examples,
MMIQC <a href="https://arxiv.org/abs/2401.09003">(Liu et al., 2024a)</a> with 2.3 million examples,
as well as vanilla rejection tuning (VRT) with 590K examples.
Both <em>DART-Math (Uniform)</em> and <em>DART-Math (Prop2Diff)</em> use 590K training examples.
<strong>Right:</strong> Number of responses for each query descending by difficulty across 3 synthesis strategies.
Queries are from the MATH training split <a href="https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/hash/be83ab3ecd0db773eb2dc1b0a17836a1-Abstract-round2.html">(Hendrycks et al., 2021)</a>.
VRT is the baseline biased towards easy queries, while <em>Uniform</em> and <em>Prop2Diff</em> are proposed in this work to balance and bias towards difficult queries respectively.
Points are slightly shifted and downsampled for clarity.
</div>
`DART-Math-Hard` contains \~585k mathematical QA pair samples constructed by applying `DARS-Prop2Diff` to the query set from MATH and GSK8K training sets, achieves **SOTA** on many challenging mathematical reasoning benchmarks. It introduces a **deliberate bias towards hard queries**, opposite to vanilla rejection sampling.
Performance produced by `DART-Math-Hard` is usually but not necessarily **slightly better (\~1% absolutely)** than `DART-Math-Uniform`, which contains \~591k samples constructed by applying `DARS-Uniform`.
### Comparison between Mathematical Instruction Tuning Datasets
Most of previous datasets are **constructed with ChatGPT**, and many of them are **not open-source**, especially for ones of the best performance.
| Math SFT Dataset | # of Samples | [MATH](https://huggingface.co/datasets/hendrycks/competition_math) | [GSM8K](https://huggingface.co/datasets/gsm8k) | [College](https://github.com/hkust-nlp/dart-math/tree/main/data/eval-dsets/mwpbench/college-math-test.jsonl) | Synthesis Agent(s) | Open-Source |
| :--------------------------------------------------------------------------------- | -----------: | -----------------------------------------------------------------: | ---------------------------------------------: | -----------------------------------------------------------------------------------------------------------: | :---------------------- | :-------------------------------------------------------------------------: |
| [WizardMath](https://arxiv.org/abs/2308.09583) | 96k | 32.3 | 80.4 | 23.1 | GPT-4 | ✗ |
| [MetaMathQA](https://arxiv.org/abs/2309.12284) | 395k | 29.8 | 76.5 | 19.3 | GPT-3.5 | [✓](https://huggingface.co/datasets/meta-math/MetaMathQA) |
| [MMIQC](https://arxiv.org/abs/2401.09003) | **2294k** | 37.4 | 75.4 | _28.5_ | **GPT-4+GPT-3.5+Human** | [**✓**](https://huggingface.co/datasets/Vivacem/MMIQC) |
| [Orca-Math](https://arxiv.org/abs/2402.14830) | 200k | -- | -- | -- | GPT-4 | [✓](https://huggingface.co/datasets/microsoft/orca-math-word-problems-200k) |
| [Xwin-Math-V1.1](https://arxiv.org/abs/2403.04706) | **1440k** | _45.5_ | **84.9** | 27.6 | **GPT-4** | **✗** |
| [KPMath-Plus](https://arxiv.org/abs/2403.02333) | **1576k** | **46.8** | 82.1 | -– | **GPT-4** | **✗** |
| [MathScaleQA](https://arxiv.org/abs/2403.02884) | 2021k | 35.2 | 74.8 | 21.8 | GPT-3.5+Human | ✗ |
| [`DART-Math-Uniform`](https://huggingface.co/datasets/hkust-nlp/dart-math-uniform) | **591k** | 43.5 | _82.6_ | 26.9 | **DeepSeekMath-7B-RL** | [**✓**](https://huggingface.co/datasets/hkust-nlp/dart-math-uniform) |
| [`DART-Math-Hard`](https://huggingface.co/datasets/hkust-nlp/dart-math-hard) | **585k** | _45.5_ | 81.1 | **29.4** | **DeepSeekMath-7B-RL** | [**✓**](https://huggingface.co/datasets/hkust-nlp/dart-math-hard) |
<sup>MATH and GSM8K are **in-domain**, while College(Math) is **out-of-domain**. Performance here are of models fine-tuned from [Mistral-7B](https://huggingface.co/mistralai/Mistral-7B-v0.1), except for Xwin-Math-V1.1 based on [Llama2-7B](https://huggingface.co/meta-llama/Llama-2-7b-hf). **Bold**/_Italic_ means the best/second best score here.</sup>
## Dataset Construction: `DARS` - Difficulty-Aware Rejection Sampling
Previous works usually synthesize data from proprietary models to augment existing datasets, followed by instruction tuning to achieve top-tier results.
However, our analysis of these datasets reveals **severe biases towards easy queries, with frequent failures to generate any correct response for the most challenging queries**.
Motivated by the observation above, we propose to *Difficulty-Aware Rejection Sampling* (`DARS`), to collect more responses for more difficult queries.
Specifically, we introduce two strategies to increase the number of correct responses for difficult queries:
1) **Uniform**, which involves sampling responses for each query until **each query accumulates $k_u$ correct
responses**, where $k_u$ is a preset hyperparameter determined by the desired size of the synthetic dataset;
2) **Prop2Diff**, where we continue sampling responses until the number of correct responses for each
query is **proportional to its difficulty score**. The most challenging queries will receive $k_p$ responses
and kp is a hyperparameter. This method introduces a deliberate bias in the opposite direction to
vanilla rejection sampling, towards more difficult queries, inspired by previous works
that demonstrate **difficult samples can be more effective to enhance model capabilities** ([Sorscher et al.,
2022](https://proceedings.neurips.cc/paper_files/paper/2022/hash/7b75da9b61eda40fa35453ee5d077df6-Abstract-Conference.html); [Liu et al., 2024b](https://openreview.net/forum?id=BTKAeLqLMw)).
See [Figure 1 (Right)](https://tongyx361.github.io/assets/dart-math/main-nresp-vs-query.png) for examples of `DART-Math-Uniform` by `DARS-Uniform` and `DART-Math-Hard` by `DARS-Prop2Diff`.
## Citation
If you find our data, model or code useful for your work, please kindly cite [our paper](https://arxiv.org/abs/2407.13690):
```latex
@article{tong2024dartmath,
title={DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving},
author={Yuxuan Tong and Xiwen Zhang and Rui Wang and Ruidong Wu and Junxian He},
year={2024},
eprint={2407.13690},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2407.13690},
}
```
> [!注意]
> 本数据集为**GSM8K训练集的查询合成信息**,例如各合成任务的原始/正确样本数量。通常需配合`dart-math-pool-gsm8k`使用。
# 🎯 DART-Math:面向数学解题的难度感知拒绝调优(Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving)
📝 [论文@arXiv](https://arxiv.org/abs/2407.13690) | 🤗 [数据集与模型@Hugging Face(HF)](https://huggingface.co/collections/hkust-nlp/dart-math-665704599b35de59f8fdf6c1) | 🐱 [代码@GitHub](https://github.com/hkust-nlp/dart-math)
🐦 [X(原Twitter)讨论帖](https://x.com/tongyx361/status/1811413243350454455) | 🐶 [中文博客@知乎](https://zhuanlan.zhihu.com/p/708371895) | 📊 [排行榜@PapersWithCode](https://paperswithcode.com/paper/dart-math-difficulty-aware-rejection-tuning#results) | 📑 [BibTeX引用](https://github.com/hkust-nlp/dart-math?tab=readme-ov-file#citation)
## 数据集:`DART-Math`
`DART-Math`数据集是面向数学推理的**最先进(state-of-the-art)**、**数据高效**的**开源**指令微调数据集。
<style>
.container {
display: flex;
justify-content: space-around;
}
.container img {
max-width: 45%;
height: auto;
}
.caption {
text-align: center;
font-size: small;
margin-top: 10px;
}
</style>
<div class="container">
<img src="https://tongyx361.github.io/assets/dart-math/main-results.png" alt="Main results averaged on 2 in-domain and 4 challenging out-of-domain mathematical reasoning benchmarks.">
<img src="https://tongyx361.github.io/assets/dart-math/main-nresp-vs-query.png" alt="Number of responses v.s. query descending in difficulty in DART-Math datasets and similar-sized VRT baseline.">
</div>
<div class="caption">
图1:<strong>左图:</strong>6个数学基准测试集上的平均准确率。我们将其与在当前最优的公开数学解题指令微调数据集上微调的模型进行对比:包含39.5万样本的MetaMath<a href="https://openreview.net/forum?id=N8N0hgNDRt">(Yu等人,2024)</a>、包含230万样本的MMIQC<a href="https://arxiv.org/abs/2401.09003">(Liu等人,2024a)</a>,以及包含59万样本的vanilla拒绝调优(vanilla rejection tuning, VRT)。<em>DART-Math(Uniform)</em>与<em>DART-Math(Prop2Diff)</em>均使用59万训练样本。<strong>右图:</strong>3种合成策略下,按难度降序排列的每个查询的响应数量。查询样本来自MATH训练集<a href="https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/hash/be83ab3ecd0db773eb2dc1b0a17836a1-Abstract-round2.html">(Hendrycks等人,2021)</a>。VRT是偏向简单查询的基线方法,而本工作提出的<em>Uniform</em>与<em>Prop2Diff</em>则分别实现了查询样本的平衡分布与偏向困难样本的分布。为便于可视化,数据点已进行小幅偏移与下采样处理。
</div>
`DART-Math-Hard`包含约58.5万道数学问答样本对,通过将`DARS-Prop2Diff`应用于MATH与GSK8K训练集的查询集合构建而成,在众多具有挑战性的数学推理基准测试中取得了**当前最优(SOTA,state-of-the-art)**成绩。该数据集**刻意偏向困难查询样本**,与vanilla拒绝采样的偏向完全相反。
`DART-Math-Hard`的模型性能通常(但非必然)比`DART-Math-Uniform`高出**约1个百分点的绝对精度**,后者包含约59.1万样本,通过应用`DARS-Uniform`构建而成。
### 数学指令微调数据集对比
此前大多数数据集均通过ChatGPT构建,且其中许多(尤其是性能最优的一批)并未开源。
| 数学监督微调数据集 | 样本数量 | [MATH](https://huggingface.co/datasets/hendrycks/competition_math) 准确率 | [GSM8K](https://huggingface.co/datasets/gsm8k) 准确率 | [大学数学测试](https://github.com/hkust-nlp/dart-math/tree/main/data/eval-dsets/mwpbench/college-math-test.jsonl) 准确率 | 合成智能体 | 开源情况 |
| :--------------------------------------------------------------------------------- | -------: | -----------------------------------------------------------------------: | ---------------------------------------------------: | --------------------------------------------------------------------------------------------------------------------: | :------------------------ | :---------------------------------------------------------------------: |
| [WizardMath](https://arxiv.org/abs/2308.09583) | 9.6万 | 32.3 | 80.4 | 23.1 | GPT-4 | 否 |
| [MetaMathQA](https://arxiv.org/abs/2309.12284) | 39.5万 | 29.8 | 76.5 | 19.3 | GPT-3.5 | [是](https://huggingface.co/datasets/meta-math/MetaMathQA) |
| [MMIQC](https://arxiv.org/abs/2401.09003) | **229.4万** | 37.4 | 75.4 | _28.5_ | **GPT-4+GPT-3.5+人类** | [**是**](https://huggingface.co/datasets/Vivacem/MMIQC) |
| [Orca-Math](https://arxiv.org/abs/2402.14830) | 20万 | -- | -- | -- | GPT-4 | [是](https://huggingface.co/datasets/microsoft/orca-math-word-problems-200k) |
| [Xwin-Math-V1.1](https://arxiv.org/abs/2403.04706) | **144万** | _45.5_ | **84.9** | 27.6 | **GPT-4** | **否** |
| [KPMath-Plus](https://arxiv.org/abs/2403.02333) | **157.6万** | **46.8** | 82.1 | -- | **GPT-4** | **否** |
| [MathScaleQA](https://arxiv.org/abs/2403.02884) | 202.1万 | 35.2 | 74.8 | 21.8 | GPT-3.5+人类 | 否 |
| [`DART-Math-Uniform`](https://huggingface.co/datasets/hkust-nlp/dart-math-uniform) | 59.1万 | 43.5 | _82.6_ | 26.9 | **DeepSeekMath-7B-RL** | [**是**](https://huggingface.co/datasets/hkust-nlp/dart-math-uniform) |
| [`DART-Math-Hard`](https://huggingface.co/datasets/hkust-nlp/dart-math-hard) | 58.5万 | _45.5_ | 81.1 | **29.4** | **DeepSeekMath-7B-RL** | [**是**](https://huggingface.co/datasets/hkust-nlp/dart-math-hard) |
<sup>MATH与GSM8K均为**域内数据集**,而大学数学测试为**域外数据集**。本表中的模型性能均基于[Mistral-7B](https://huggingface.co/mistralai/Mistral-7B-v0.1)微调得到,仅基于Llama2-7B微调的Xwin-Math-V1.1除外。<strong>粗体</strong>/<em>斜体</em>分别代表本表中的最优/次优得分。</sup>
## 数据集构建:`DARS`——难度感知拒绝采样(Difficulty-Aware Rejection Sampling)
此前的研究通常通过专有模型合成数据以扩充现有数据集,随后通过指令微调获得顶尖性能。然而,我们对这些数据集的分析显示,它们**严重偏向简单查询样本**,且往往无法为最具挑战性的查询生成任何正确响应。
基于上述观察,我们提出了*难度感知拒绝采样(Difficulty-Aware Rejection Sampling, DARS)*方法,为难度更高的查询收集更多响应。具体而言,我们提出了两种策略以提升困难查询的正确响应数量:
1. **Uniform策略**:对每个查询持续采样响应,直至**每个查询累计获得$k_u$个正确响应**,其中$k_u$为预设超参数,其取值由合成数据集的目标规模决定;
2. **Prop2Diff策略**:持续采样响应直至**每个查询的正确响应数量与其难度得分成正比**。其中难度最高的查询将获得$k_p$个响应,$k_p$同样为预设超参数。该方法刻意采用与vanilla拒绝采样相反的偏向方向,即偏向困难查询样本;这一设计的灵感来自此前的研究,这些研究表明**困难样本可更有效地提升模型性能**([Sorscher等人,2022](https://proceedings.neurips.cc/paper_files/paper/2022/hash/7b75da9b61eda40fa35453ee5d077df6-Abstract-Conference.html);[Liu等人,2024b](https://openreview.net/forum?id=BTKAeLqLMw))。
可参考[图1(右图)](https://tongyx361.github.io/assets/dart-math/main-nresp-vs-query.png),了解通过`DARS-Uniform`构建的`DART-Math-Uniform`与通过`DARS-Prop2Diff`构建的`DART-Math-Hard`的示例效果。
## 引用
若您的工作中用到了本数据集、模型或代码,请引用[我们的论文](https://arxiv.org/abs/2407.13690):
latex
@article{tong2024dartmath,
title={DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving},
author={Yuxuan Tong and Xiwen Zhang and Rui Wang and Ruidong Wu and Junxian He},
year={2024},
eprint={2407.13690},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2407.13690},
}
提供机构:
maas
创建时间:
2025-02-17



