dart-math-hard
收藏魔搭社区2025-12-05 更新2025-02-22 收录
下载链接:
https://modelscope.cn/datasets/hkust-nlp/dart-math-hard
下载链接
链接失效反馈官方服务:
资源简介:
# 🎯 DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving
📝 [Paper@arXiv](https://arxiv.org/abs/2407.13690) | 🤗 [Datasets&Models@HF](https://huggingface.co/collections/hkust-nlp/dart-math-665704599b35de59f8fdf6c1) | 🐱 [Code@GitHub](https://github.com/hkust-nlp/dart-math)
🐦 [Thread@X(Twitter)](https://x.com/tongyx361/status/1811413243350454455) | 🐶 [中文博客@知乎](https://zhuanlan.zhihu.com/p/708371895) | 📊 [Leaderboard@PapersWithCode](https://paperswithcode.com/paper/dart-math-difficulty-aware-rejection-tuning#results) | 📑 [BibTeX](https://github.com/hkust-nlp/dart-math?tab=readme-ov-file#citation)
> [!IMPORTANT]
> 🔥 Excited to find **[our `DART-Math-DSMath-7B` (Prop2Diff)](https://huggingface.co/hkust-nlp/dart-math-dsmath-7b-prop2diff) trained on [`DART-Math-Hard`](https://huggingface.co/datasets/hkust-nlp/dart-math-hard) [comparable](https://github.com/project-numina/aimo-progress-prize/blob/main/report/numina_dataset.pdf) to the AIMO winner [NuminaMath-7B](https://huggingface.co/AI-MO/NuminaMath-7B-CoT)** on CoT,
> but based solely on [MATH](https://huggingface.co/datasets/hkust-nlp/dart-math-pool-math-query-info) & [GSM8K](https://huggingface.co/datasets/hkust-nlp/dart-math-pool-gsm8k-query-info) prompt set, leaving much room to improve!
> Besides, our [`DART` method](https://github.com/hkust-nlp/dart-math?tab=readme-ov-file#dars--difficulty-aware-rejection-sampling) is also fully compatible with [tool-integrated reasoning](https://github.com/hkust-nlp/dart-math?tab=readme-ov-file#tool-integrated-reasoning-reasoning-in-natural-language-interleaved-with-python-code).
> Find more details and join the discussion under this [X thread](https://x.com/tongyx361/status/1815112376649134172)!
## Datasets: `DART-Math`
`DART-Math` datasets are the **state-of-the-art** and **data-efficient** **open-source** instruction tuning datasets for mathematical reasoning.
<style>
.container {
display: flex;
justify-content: space-around;
}
.container img {
max-width: 45%;
height: auto;
}
.caption {
text-align: center;
font-size: small;
margin-top: 10px;
}
</style>
<div class="container">
<img src="https://tongyx361.github.io/assets/dart-math/main-results.png" alt="Main results averaged on 2 in-domain and 4 challenging out-of-domain mathematical reasoning benchmarks.">
<img src="https://tongyx361.github.io/assets/dart-math/main-nresp-vs-query.png" alt="Number of responses v.s. query descending in difficulty in DART-Math datasets and similar-sized VRT baseline.">
</div>
<div class="caption">
Figure 1: <strong>Left:</strong> Average accuracy on 6 mathematical benchmarks. We compare with models fine-tuned on the best, public instruction tuning datasets for mathematical problem-solving:
MetaMath <a href="https://openreview.net/forum?id=N8N0hgNDRt">(Yu et al., 2024)</a> with 395K
examples,
MMIQC <a href="https://arxiv.org/abs/2401.09003">(Liu et al., 2024a)</a> with 2.3 million examples,
as well as vanilla rejection tuning (VRT) with 590K examples.
Both <em>DART-Math (Uniform)</em> and <em>DART-Math (Prop2Diff)</em> use 590K training examples.
<strong>Right:</strong> Number of responses for each query descending by difficulty across 3 synthesis strategies.
Queries are from the MATH training split <a href="https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/hash/be83ab3ecd0db773eb2dc1b0a17836a1-Abstract-round2.html">(Hendrycks et al., 2021)</a>.
VRT is the baseline biased towards easy queries, while <em>Uniform</em> and <em>Prop2Diff</em> are proposed in this work to balance and bias towards difficult queries respectively.
Points are slightly shifted and downsampled for clarity.
</div>
`DART-Math-Hard` contains \~585k mathematical QA pair samples constructed by applying `DARS-Prop2Diff` to the query set from MATH and GSK8K training sets, achieves **SOTA** on many challenging mathematical reasoning benchmarks. It introduces a **deliberate bias towards hard queries**, opposite to vanilla rejection sampling.
Performance produced by `DART-Math-Hard` is usually but not necessarily **slightly better (\~1% absolutely)** than `DART-Math-Uniform`, which contains \~591k samples constructed by applying `DARS-Uniform`.
### Comparison between Mathematical Instruction Tuning Datasets
Most of previous datasets are **constructed with ChatGPT**, and many of them are **not open-source**, especially for ones of the best performance.
| Math SFT Dataset | # of Samples | [MATH](https://huggingface.co/datasets/hendrycks/competition_math) | [GSM8K](https://huggingface.co/datasets/gsm8k) | [College](https://github.com/hkust-nlp/dart-math/tree/main/data/eval-dsets/mwpbench/college-math-test.jsonl) | Synthesis Agent(s) | Open-Source |
| :--------------------------------------------------------------------------------- | -----------: | -----------------------------------------------------------------: | ---------------------------------------------: | -----------------------------------------------------------------------------------------------------------: | :---------------------- | :-------------------------------------------------------------------------: |
| [WizardMath](https://arxiv.org/abs/2308.09583) | 96k | 32.3 | 80.4 | 23.1 | GPT-4 | ✗ |
| [MetaMathQA](https://arxiv.org/abs/2309.12284) | 395k | 29.8 | 76.5 | 19.3 | GPT-3.5 | [✓](https://huggingface.co/datasets/meta-math/MetaMathQA) |
| [MMIQC](https://arxiv.org/abs/2401.09003) | **2294k** | 37.4 | 75.4 | _28.5_ | **GPT-4+GPT-3.5+Human** | [**✓**](https://huggingface.co/datasets/Vivacem/MMIQC) |
| [Orca-Math](https://arxiv.org/abs/2402.14830) | 200k | -- | -- | -- | GPT-4 | [✓](https://huggingface.co/datasets/microsoft/orca-math-word-problems-200k) |
| [Xwin-Math-V1.1](https://arxiv.org/abs/2403.04706) | **1440k** | _45.5_ | **84.9** | 27.6 | **GPT-4** | **✗** |
| [KPMath-Plus](https://arxiv.org/abs/2403.02333) | **1576k** | **46.8** | 82.1 | -– | **GPT-4** | **✗** |
| [MathScaleQA](https://arxiv.org/abs/2403.02884) | 2021k | 35.2 | 74.8 | 21.8 | GPT-3.5+Human | ✗ |
| [`DART-Math-Uniform`](https://huggingface.co/datasets/hkust-nlp/dart-math-uniform) | **591k** | 43.5 | _82.6_ | 26.9 | **DeepSeekMath-7B-RL** | [**✓**](https://huggingface.co/datasets/hkust-nlp/dart-math-uniform) |
| [`DART-Math-Hard`](https://huggingface.co/datasets/hkust-nlp/dart-math-hard) | **585k** | _45.5_ | 81.1 | **29.4** | **DeepSeekMath-7B-RL** | [**✓**](https://huggingface.co/datasets/hkust-nlp/dart-math-hard) |
<sup>MATH and GSM8K are **in-domain**, while College(Math) is **out-of-domain**. Performance here are of models fine-tuned from [Mistral-7B](https://huggingface.co/mistralai/Mistral-7B-v0.1), except for Xwin-Math-V1.1 based on [Llama2-7B](https://huggingface.co/meta-llama/Llama-2-7b-hf). **Bold**/_Italic_ means the best/second best score here.</sup>
## Dataset Construction: `DARS` - Difficulty-Aware Rejection Sampling
Previous works usually synthesize data from proprietary models to augment existing datasets, followed by instruction tuning to achieve top-tier results.
However, our analysis of these datasets reveals **severe biases towards easy queries, with frequent failures to generate any correct response for the most challenging queries**.
Motivated by the observation above, we propose to *Difficulty-Aware Rejection Sampling* (`DARS`), to collect more responses for more difficult queries.
Specifically, we introduce two strategies to increase the number of correct responses for difficult queries:
1) **Uniform**, which involves sampling responses for each query until **each query accumulates $k_u$ correct
responses**, where $k_u$ is a preset hyperparameter determined by the desired size of the synthetic dataset;
2) **Prop2Diff**, where we continue sampling responses until the number of correct responses for each
query is **proportional to its difficulty score**. The most challenging queries will receive $k_p$ responses
and kp is a hyperparameter. This method introduces a deliberate bias in the opposite direction to
vanilla rejection sampling, towards more difficult queries, inspired by previous works
that demonstrate **difficult samples can be more effective to enhance model capabilities** ([Sorscher et al.,
2022](https://proceedings.neurips.cc/paper_files/paper/2022/hash/7b75da9b61eda40fa35453ee5d077df6-Abstract-Conference.html); [Liu et al., 2024b](https://openreview.net/forum?id=BTKAeLqLMw)).
See [Figure 1 (Right)](https://tongyx361.github.io/assets/dart-math/main-nresp-vs-query.png) for examples of `DART-Math-Uniform` by `DARS-Uniform` and `DART-Math-Hard` by `DARS-Prop2Diff`.
## Citation
If you find our data, model or code useful for your work, please kindly cite [our paper](https://arxiv.org/abs/2407.13690):
```latex
@article{tong2024dartmath,
title={DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving},
author={Yuxuan Tong and Xiwen Zhang and Rui Wang and Ruidong Wu and Junxian He},
year={2024},
eprint={2407.13690},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2407.13690},
}
```
# 🎯 DART-Math: 面向数学解题的难度感知拒绝微调(Difficulty-Aware Rejection Tuning)
📝 [论文@arXiv](https://arxiv.org/abs/2407.13690) | 🤗 [数据集与模型@HF](https://huggingface.co/collections/hkust-nlp/dart-math-665704599b35de59f8fdf6c1) | 🐱 [代码@GitHub](https://github.com/hkust-nlp/dart-math)
🐦 [动态@X(Twitter)](https://x.com/tongyx361/status/1811413243350454455) | 🐶 [中文博客@知乎](https://zhuanlan.zhihu.com/p/708371895) | 📊 [排行榜@PapersWithCode](https://paperswithcode.com/paper/dart-math-difficulty-aware-rejection-tuning#results) | 📑 [BibTeX引用](https://github.com/hkust-nlp/dart-math?tab=readme-ov-file#citation)
> [!IMPORTANT]
> 🔥 令人振奋的是,我们基于`DART-Math-Hard`数据集训练的**`DART-Math-DSMath-7B (Prop2Diff)`**模型,在思维链(Chain-of-Thought, CoT)任务上的表现可与AIMO冠军模型[NuminaMath-7B](https://huggingface.co/AI-MO/NuminaMath-7B-CoT)相媲美,且该模型仅基于[MATH](https://huggingface.co/datasets/hkust-nlp/dart-math-pool-math-query-info)与[GSM8K](https://huggingface.co/datasets/hkust-nlp/dart-math-pool-gsm8k-query-info)的提示集构建,仍有极大的优化空间!此外,我们提出的[`DART`方法](https://github.com/hkust-nlp/dart-math?tab=readme-ov-file#dars--difficulty-aware-rejection-sampling)完全兼容[工具集成推理](https://github.com/hkust-nlp/dart-math?tab=readme-ov-file#tool-integrated-reasoning-reasoning-in-natural-language-interleaved-with-python-code)(即自然语言与Python代码交错的推理方式)。更多细节可参与该[X动态](https://x.com/tongyx361/status/1815112376649134172)下的讨论!
## 数据集:`DART-Math`
`DART-Math`系列数据集是当前面向数学推理任务的**最优(state-of-the-art, SOTA)**且**数据高效**的**开源**指令微调(instruction tuning)数据集。
<style>
.container {
display: flex;
justify-content: space-around;
}
.container img {
max-width: 45%;
height: auto;
}
.caption {
text-align: center;
font-size: small;
margin-top: 10px;
}
</style>
<div class="container">
<img src="https://tongyx361.github.io/assets/dart-math/main-results.png" alt="Main results averaged on 2 in-domain and 4 challenging out-of-domain mathematical reasoning benchmarks.">
<img src="https://tongyx361.github.io/assets/dart-math/main-nresp-vs-query.png" alt="Number of responses v.s. query descending in difficulty in DART-Math datasets and similar-sized VRT baseline.">
</div>
<div class="caption">
图1:<strong>左图</strong>:6个数学基准任务上的平均准确率。我们与在当前最优的公开数学解题指令微调数据集上微调的模型进行了对比:包含39.5万样本的MetaMath(Yu等,2024)、包含230万样本的MMIQC(Liu等,2024a),以及包含59万样本的原生拒绝微调(VRT)基线。`DART-Math (Uniform)`与`DART-Math (Prop2Diff)`均使用59万训练样本。<strong>右图</strong>:3种合成策略下,按难度降序排列的每个查询对应的响应数量。查询来自MATH训练集(Hendrycks等,2021)。VRT基线偏向简单查询,而本文提出的`Uniform`与`Prop2Diff`策略分别实现了查询分布的平衡与偏向困难查询的设置。为便于可视化,图中点位略有偏移与下采样处理。
</div>
`DART-Math-Hard`数据集包含约58.5万个数学问答对(QA pair)样本,其通过将`DARS-Prop2Diff`策略应用于MATH与GSM8K训练集的查询集构建而成,在诸多挑战性数学推理基准任务上达到了**当前最优(SOTA)**性能。该数据集刻意偏向困难查询样本,与原生拒绝采样的偏向逻辑相反。
`DART-Math-Hard`的性能通常(但非绝对)比`DART-Math-Uniform`略优(绝对性能提升约1%);后者包含约59.1万个样本,通过`DARS-Uniform`策略构建而成。
### 数学指令微调数据集对比
此前多数数学指令微调数据集均通过ChatGPT合成,且其中不少(尤其是性能顶尖的数据集)并未开源。
| 数学监督微调数据集 | 样本数量 | [MATH](https://huggingface.co/datasets/hendrycks/competition_math) | [GSM8K](https://huggingface.co/datasets/gsm8k) | [大学数学](https://github.com/hkust-nlp/dart-math/tree/main/data/eval-dsets/mwpbench/college-math-test.jsonl) | 合成代理(Synthesis Agent) | 开源状态 |
| :--------------------------------------------------------------------------------- | -----------: | -----------------------------------------------------------------: | ---------------------------------------------: | -----------------------------------------------------------------------------------------------------------: | :-------------------------- | :-------------------------------------------------------------------------: |
| [WizardMath](https://arxiv.org/abs/2308.09583) | 9.6万 | 32.3 | 80.4 | 23.1 | GPT-4 | ✗ |
| [MetaMathQA](https://arxiv.org/abs/2309.12284) | 39.5万 | 29.8 | 76.5 | 19.3 | GPT-3.5 | [✓](https://huggingface.co/datasets/meta-math/MetaMathQA) |
| [MMIQC](https://arxiv.org/abs/2401.09003) | **229.4万** | 37.4 | 75.4 | _28.5_ | **GPT-4+GPT-3.5+Human** | [**✓**](https://huggingface.co/datasets/Vivacem/MMIQC) |
| [Orca-Math](https://arxiv.org/abs/2402.14830) | 20.0万 | -- | -- | -- | GPT-4 | [✓](https://huggingface.co/datasets/microsoft/orca-math-word-problems-200k) |
| [Xwin-Math-V1.1](https://arxiv.org/abs/2403.04706) | **144.0万** | _45.5_ | **84.9** | 27.6 | **GPT-4** | **✗** |
| [KPMath-Plus](https://arxiv.org/abs/2403.02333) | **157.6万** | **46.8** | 82.1 | -– | **GPT-4** | **✗** |
| [MathScaleQA](https://arxiv.org/abs/2403.02884) | 202.1万 | 35.2 | 74.8 | 21.8 | GPT-3.5+Human | ✗ |
| [`DART-Math-Uniform`](https://huggingface.co/datasets/hkust-nlp/dart-math-uniform) | **59.1万** | 43.5 | _82.6_ | 26.9 | **DeepSeekMath-7B-RL** | [**✓**](https://huggingface.co/datasets/hkust-nlp/dart-math-uniform) |
| [`DART-Math-Hard`](https://huggingface.co/datasets/hkust-nlp/dart-math-hard) | **58.5万** | _45.5_ | 81.1 | **29.4** | **DeepSeekMath-7B-RL** | [**✓**](https://huggingface.co/datasets/hkust-nlp/dart-math-hard) |
<sup>MATH和GSM8K属于**域内数据集**,而大学数学数据集为**域外数据集**。下表中的性能指标均基于Mistral-7B([Mistral-7B](https://huggingface.co/mistralai/Mistral-7B-v0.1))微调得到的模型,仅Xwin-Math-V1.1基于Llama2-7B([Llama2-7B](https://huggingface.co/meta-llama/Llama-2-7b-hf))微调。**粗体**与_斜体_分别代表当前表格中的最优与次优得分。</sup>
## 数据集构建:`DARS`——难度感知拒绝采样
此前的研究通常通过闭源模型(proprietary model)合成数据以扩充现有数据集,再通过指令微调实现顶尖性能。但我们对这类数据集的分析显示,其**存在严重的简单查询偏向问题,且难以对最具挑战性的查询生成任何正确响应**。
基于上述观察,我们提出**难度感知拒绝采样(Difficulty-Aware Rejection Sampling, DARS)**方法,为更困难的查询收集更多响应。具体而言,我们提出两种策略以提升困难查询的正确响应数量:
1) **Uniform策略**:对每个查询持续采样响应,直至**该查询累计获得$k_u$个正确响应**,其中$k_u$为预设超参数(hyperparameter),其取值由合成数据集的目标规模决定;
2) **Prop2Diff策略**:对每个查询持续采样响应,直至该查询的正确响应数量**与其难度得分成正比**。最具挑战性的查询将获得$k_p$个响应,$k_p$同样为预设超参数。受此前研究(Sorscher等,2022;Liu等,2024b)证实「困难样本可更有效地提升模型能力」的启发,该策略刻意采用与原生拒绝采样相反的偏向逻辑,将数据集分布向更困难的查询倾斜。
可参考[图1(右图)](https://tongyx361.github.io/assets/dart-math/main-nresp-vs-query.png),了解通过`DARS-Uniform`构建的`DART-Math-Uniform`与通过`DARS-Prop2Diff`构建的`DART-Math-Hard`的示例。
## 引用
如果您认为本数据集、模型或代码对您的工作有所帮助,请引用我们的[论文](https://arxiv.org/abs/2407.13690):
latex
@article{tong2024dartmath,
title={DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving},
author={Yuxuan Tong and Xiwen Zhang and Rui Wang and Ruidong Wu and Junxian He},
year={2024},
eprint={2407.13690},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2407.13690},
}
提供机构:
maas
创建时间:
2025-02-17



