dart-math-pool-math
收藏魔搭社区2026-01-06 更新2025-02-22 收录
下载链接:
https://modelscope.cn/datasets/hkust-nlp/dart-math-pool-math
下载链接
链接失效反馈官方服务:
资源简介:
> [!NOTE]
> This dataset is the data pool synthesized from the query set of the **MATH** training set,
> containing **all answer-correct samples** and other metadata produced during the work.
> `DART-Math-*` datasets are extracted from `dart-math-pool-*` data pools.
# 🎯 DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving
📝 [Paper@arXiv](https://arxiv.org/abs/2407.13690) | 🤗 [Datasets&Models@HF](https://huggingface.co/collections/hkust-nlp/dart-math-665704599b35de59f8fdf6c1) | 🐱 [Code@GitHub](https://github.com/hkust-nlp/dart-math)
🐦 [Thread@X(Twitter)](https://x.com/tongyx361/status/1811413243350454455) | 🐶 [中文博客@知乎](https://zhuanlan.zhihu.com/p/708371895) | 📊 [Leaderboard@PapersWithCode](https://paperswithcode.com/paper/dart-math-difficulty-aware-rejection-tuning#results) | 📑 [BibTeX](https://github.com/hkust-nlp/dart-math?tab=readme-ov-file#citation)
## Datasets: `DART-Math`
`DART-Math` datasets are the **state-of-the-art** and **data-efficient** **open-source** instruction tuning datasets for mathematical reasoning.
<style>
.container {
display: flex;
justify-content: space-around;
}
.container img {
max-width: 45%;
height: auto;
}
.caption {
text-align: center;
font-size: small;
margin-top: 10px;
}
</style>
<div class="container">
<img src="https://tongyx361.github.io/assets/dart-math/main-results.png" alt="Main results averaged on 2 in-domain and 4 challenging out-of-domain mathematical reasoning benchmarks.">
<img src="https://tongyx361.github.io/assets/dart-math/main-nresp-vs-query.png" alt="Number of responses v.s. query descending in difficulty in DART-Math datasets and similar-sized VRT baseline.">
</div>
<div class="caption">
Figure 1: <strong>Left:</strong> Average accuracy on 6 mathematical benchmarks. We compare with models fine-tuned on the best, public instruction tuning datasets for mathematical problem-solving:
MetaMath <a href="https://openreview.net/forum?id=N8N0hgNDRt">(Yu et al., 2024)</a> with 395K
examples,
MMIQC <a href="https://arxiv.org/abs/2401.09003">(Liu et al., 2024a)</a> with 2.3 million examples,
as well as vanilla rejection tuning (VRT) with 590K examples.
Both <em>DART-Math (Uniform)</em> and <em>DART-Math (Prop2Diff)</em> use 590K training examples.
<strong>Right:</strong> Number of responses for each query descending by difficulty across 3 synthesis strategies.
Queries are from the MATH training split <a href="https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/hash/be83ab3ecd0db773eb2dc1b0a17836a1-Abstract-round2.html">(Hendrycks et al., 2021)</a>.
VRT is the baseline biased towards easy queries, while <em>Uniform</em> and <em>Prop2Diff</em> are proposed in this work to balance and bias towards difficult queries respectively.
Points are slightly shifted and downsampled for clarity.
</div>
`DART-Math-Hard` contains \~585k mathematical QA pair samples constructed by applying `DARS-Prop2Diff` to the query set from MATH and GSK8K training sets, achieves **SOTA** on many challenging mathematical reasoning benchmarks. It introduces a **deliberate bias towards hard queries**, opposite to vanilla rejection sampling.
Performance produced by `DART-Math-Hard` is usually but not necessarily **slightly better (\~1% absolutely)** than `DART-Math-Uniform`, which contains \~591k samples constructed by applying `DARS-Uniform`.
### Comparison between Mathematical Instruction Tuning Datasets
Most of previous datasets are **constructed with ChatGPT**, and many of them are **not open-source**, especially for ones of the best performance.
| Math SFT Dataset | # of Samples | [MATH](https://huggingface.co/datasets/hendrycks/competition_math) | [GSM8K](https://huggingface.co/datasets/gsm8k) | [College](https://github.com/hkust-nlp/dart-math/tree/main/data/eval-dsets/mwpbench/college-math-test.jsonl) | Synthesis Agent(s) | Open-Source |
| :--------------------------------------------------------------------------------- | -----------: | -----------------------------------------------------------------: | ---------------------------------------------: | -----------------------------------------------------------------------------------------------------------: | :---------------------- | :-------------------------------------------------------------------------: |
| [WizardMath](https://arxiv.org/abs/2308.09583) | 96k | 32.3 | 80.4 | 23.1 | GPT-4 | ✗ |
| [MetaMathQA](https://arxiv.org/abs/2309.12284) | 395k | 29.8 | 76.5 | 19.3 | GPT-3.5 | [✓](https://huggingface.co/datasets/meta-math/MetaMathQA) |
| [MMIQC](https://arxiv.org/abs/2401.09003) | **2294k** | 37.4 | 75.4 | _28.5_ | **GPT-4+GPT-3.5+Human** | [**✓**](https://huggingface.co/datasets/Vivacem/MMIQC) |
| [Orca-Math](https://arxiv.org/abs/2402.14830) | 200k | -- | -- | -- | GPT-4 | [✓](https://huggingface.co/datasets/microsoft/orca-math-word-problems-200k) |
| [Xwin-Math-V1.1](https://arxiv.org/abs/2403.04706) | **1440k** | _45.5_ | **84.9** | 27.6 | **GPT-4** | **✗** |
| [KPMath-Plus](https://arxiv.org/abs/2403.02333) | **1576k** | **46.8** | 82.1 | -– | **GPT-4** | **✗** |
| [MathScaleQA](https://arxiv.org/abs/2403.02884) | 2021k | 35.2 | 74.8 | 21.8 | GPT-3.5+Human | ✗ |
| [`DART-Math-Uniform`](https://huggingface.co/datasets/hkust-nlp/dart-math-uniform) | **591k** | 43.5 | _82.6_ | 26.9 | **DeepSeekMath-7B-RL** | [**✓**](https://huggingface.co/datasets/hkust-nlp/dart-math-uniform) |
| [`DART-Math-Hard`](https://huggingface.co/datasets/hkust-nlp/dart-math-hard) | **585k** | _45.5_ | 81.1 | **29.4** | **DeepSeekMath-7B-RL** | [**✓**](https://huggingface.co/datasets/hkust-nlp/dart-math-hard) |
<sup>MATH and GSM8K are **in-domain**, while College(Math) is **out-of-domain**. Performance here are of models fine-tuned from [Mistral-7B](https://huggingface.co/mistralai/Mistral-7B-v0.1), except for Xwin-Math-V1.1 based on [Llama2-7B](https://huggingface.co/meta-llama/Llama-2-7b-hf). **Bold**/_Italic_ means the best/second best score here.</sup>
## Dataset Construction: `DARS` - Difficulty-Aware Rejection Sampling
Previous works usually synthesize data from proprietary models to augment existing datasets, followed by instruction tuning to achieve top-tier results.
However, our analysis of these datasets reveals **severe biases towards easy queries, with frequent failures to generate any correct response for the most challenging queries**.
Motivated by the observation above, we propose to *Difficulty-Aware Rejection Sampling* (`DARS`), to collect more responses for more difficult queries.
Specifically, we introduce two strategies to increase the number of correct responses for difficult queries:
1) **Uniform**, which involves sampling responses for each query until **each query accumulates $k_u$ correct
responses**, where $k_u$ is a preset hyperparameter determined by the desired size of the synthetic dataset;
2) **Prop2Diff**, where we continue sampling responses until the number of correct responses for each
query is **proportional to its difficulty score**. The most challenging queries will receive $k_p$ responses
and kp is a hyperparameter. This method introduces a deliberate bias in the opposite direction to
vanilla rejection sampling, towards more difficult queries, inspired by previous works
that demonstrate **difficult samples can be more effective to enhance model capabilities** ([Sorscher et al.,
2022](https://proceedings.neurips.cc/paper_files/paper/2022/hash/7b75da9b61eda40fa35453ee5d077df6-Abstract-Conference.html); [Liu et al., 2024b](https://openreview.net/forum?id=BTKAeLqLMw)).
See [Figure 1 (Right)](https://tongyx361.github.io/assets/dart-math/main-nresp-vs-query.png) for examples of `DART-Math-Uniform` by `DARS-Uniform` and `DART-Math-Hard` by `DARS-Prop2Diff`.
## Citation
If you find our data, model or code useful for your work, please kindly cite [our paper](https://arxiv.org/abs/2407.13690):
```latex
@article{tong2024dartmath,
title={DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving},
author={Yuxuan Tong and Xiwen Zhang and Rui Wang and Ruidong Wu and Junxian He},
year={2024},
eprint={2407.13690},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2407.13690},
}
```
> [!NOTE]
本数据集是由**MATH(MATH)**训练集的查询集合成的数据池,包含所有答案正确的样本以及研究过程中产生的其他元数据。`DART-Math-*`系列数据集均从`dart-math-pool-*`数据池中抽取得到。
# 🎯 DART-Math:面向数学解题的难度感知拒绝调优
📝 [论文@arXiv](https://arxiv.org/abs/2407.13690) | 🤗 [数据集与模型@HF](https://huggingface.co/collections/hkust-nlp/dart-math-665704599b35de59f8fdf6c1) | 🐱 [代码@GitHub](https://github.com/hkust-nlp/dart-math)
🐦 [X(Twitter)讨论帖](https://x.com/tongyx361/status/1811413243350454455) | 🐶 [中文博客@知乎](https://zhuanlan.zhihu.com/p/708371895) | 📊 [排行榜@PapersWithCode](https://paperswithcode.com/paper/dart-math-difficulty-aware-rejection-tuning#results) | 📑 [BibTeX引用格式](https://github.com/hkust-nlp/dart-math?tab=readme-ov-file#citation)
## 数据集:`DART-Math`
`DART-Math`系列数据集是当前**最先进(state-of-the-art)**且**数据高效**的**开源**数学推理指令调优数据集。
<style>
.container {
display: flex;
justify-content: space-around;
}
.container img {
max-width: 45%;
height: auto;
}
.caption {
text-align: center;
font-size: small;
margin-top: 10px;
}
</style>
<div class="container">
<img src="https://tongyx361.github.io/assets/dart-math/main-results.png" alt="在2个域内和4个具有挑战性的域外数学推理基准上的平均主实验结果。">
<img src="https://tongyx361.github.io/assets/dart-math/main-nresp-vs-query.png" alt="DART-Math数据集与同规模VRT基线中,按难度降序排列的每个查询的响应数量对比。">
</div>
<div class="caption">
图1:<strong>左图:</strong>6个数学基准数据集上的平均准确率。我们与在当前最优的公开数学解题指令调优数据集上微调的模型进行对比:包含39.5万样本的MetaMath <a href="https://openreview.net/forum?id=N8N0hgNDRt">(Yu et al., 2024)</a>、包含230万样本的MMIQC <a href="https://arxiv.org/abs/2401.09003">(Liu et al., 2024a)</a>,以及包含59万样本的普通拒绝调优(VRT,vanilla rejection tuning)。<em>DART-Math (Uniform)</em>与<em>DART-Math (Prop2Diff)</em>均使用59万训练样本。<strong>右图:</strong>3种合成策略下,按难度降序排列的每个查询的响应数量。查询样本均来自MATH训练子集 <a href="https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/hash/be83ab3ecd0db773eb2dc1b0a17836a1-Abstract-round2.html">(Hendrycks et al., 2021)</a>。VRT基线偏向简单查询,而本工作提出的<em>Uniform</em>与<em>Prop2Diff</em>分别实现查询样本的平衡分布与偏向困难样本的分布。为提升可读性,部分数据点进行了轻微偏移与下采样处理。
</div>
`DART-Math-Hard`包含约58.5万个数学问答样本对,通过将`DARS-Prop2Diff`应用于MATH与GSK8K训练集的查询集构建而成,在众多具有挑战性的数学推理基准上达到**SOTA(最先进水平)**。与普通拒绝采样相反,该数据集**刻意偏向困难查询样本**。
`DART-Math-Hard`的模型性能通常(但非必然)比`DART-Math-Uniform`高出**约1个百分点(绝对精度)**,后者包含约59.1万个通过`DARS-Uniform`构建的样本。
### 数学指令调优数据集对比
| 数学监督微调数据集 | 样本数量 | [MATH(MATH)](https://huggingface.co/datasets/hendrycks/competition_math) | [GSM8K(GSM8K)](https://huggingface.co/datasets/gsm8k) | [大学数学题](https://github.com/hkust-nlp/dart-math/tree/main/data/eval-dsets/mwpbench/college-math-test.jsonl) | 合成智能体 | 开源状态 |
| :--------------------------------------------------------------------------------- | -----------: | -----------------------------------------------------------------: | ---------------------------------------------: | -----------------------------------------------------------------------------------------------------------: | :---------------------- | :-------------------------------------------------------------------------: |
| [WizardMath](https://arxiv.org/abs/2308.09583) | 9.6万 | 32.3 | 80.4 | 23.1 | GPT-4 | ✗ |
| [MetaMathQA](https://arxiv.org/abs/2309.12284) | 39.5万 | 29.8 | 76.5 | 19.3 | GPT-3.5 | [✓](https://huggingface.co/datasets/meta-math/MetaMathQA) |
| [MMIQC](https://arxiv.org/abs/2401.09003) | **229.4万** | 37.4 | 75.4 | _28.5_ | **GPT-4+GPT-3.5+人工标注** | [**✓**](https://huggingface.co/datasets/Vivacem/MMIQC) |
| [Orca-Math](https://arxiv.org/abs/2402.14830) | 20万 | -- | -- | -- | GPT-4 | [✓](https://huggingface.co/datasets/microsoft/orca-math-word-problems-200k) |
| [Xwin-Math-V1.1](https://arxiv.org/abs/2403.04706) | **144万** | _45.5_ | **84.9** | 27.6 | **GPT-4** | **✗** |
| [KPMath-Plus](https://arxiv.org/abs/2403.02333) | **157.6万** | **46.8** | 82.1 | -– | **GPT-4** | **✗** |
| [MathScaleQA](https://arxiv.org/abs/2403.02884) | 202.1万 | 35.2 | 74.8 | 21.8 | GPT-3.5+人工标注 | ✗ |
| [`DART-Math-Uniform`](https://huggingface.co/datasets/hkust-nlp/dart-math-uniform) | **59.1万** | 43.5 | _82.6_ | 26.9 | **DeepSeekMath-7B-RL** | [**✓**](https://huggingface.co/datasets/hkust-nlp/dart-math-uniform) |
| [`DART-Math-Hard`](https://huggingface.co/datasets/hkust-nlp/dart-math-hard) | **58.5万** | _45.5_ | 81.1 | **29.4** | **DeepSeekMath-7B-RL** | [**✓**](https://huggingface.co/datasets/hkust-nlp/dart-math-hard) |
<sup>MATH与GSM8K为**域内数据集**,而大学数学题数据集为**域外数据集**。本次实验的性能均基于从[Mistral-7B(Mistral-7B)](https://huggingface.co/mistralai/Mistral-7B-v0.1)微调得到的模型,仅Xwin-Math-V1.1基于[Llama2-7B(Llama2-7B)](https://huggingface.co/meta-llama/Llama-2-7b-hf)构建。**粗体**/_斜体_分别代表该基准下的最优与次优成绩。</sup>
## 数据集构建:`DARS`——难度感知拒绝采样
以往的工作通常通过专有模型合成数据以扩充现有数据集,随后进行指令调优以获得顶尖性能。但我们对这些数据集的分析显示,它们**存在严重的简单查询样本偏向性,且在处理最具挑战性的查询时,常常无法生成任何正确响应**。
基于上述观察,我们提出**难度感知拒绝采样(Difficulty-Aware Rejection Sampling,DARS)**,用于为更困难的查询收集更多有效响应。具体而言,我们提出两种策略以提升困难查询的正确响应数量:
1) **Uniform(均匀采样策略)**:对每个查询进行响应采样,直至**每个查询累计获得$k_u$个正确响应**,其中$k_u$为预设超参数,其取值由合成数据集的目标规模决定;
2) **Prop2Diff(难度比例采样策略)**:持续对查询进行响应采样,直至每个查询的正确响应数量**与其难度得分成正比**。最具挑战性的查询将获得$k_p$个响应,其中$k_p$为预设超参数。该方法与普通拒绝采样的偏向性相反,刻意偏向困难查询样本,其灵感来自先前的研究——这些研究表明**困难样本可更有效地提升模型性能**([Sorscher et al., 2022](https://proceedings.neurips.cc/paper_files/paper/2022/hash/7b75da9b61eda40fa35453ee5d077df6-Abstract-Conference.html); [Liu et al., 2024b](https://openreview.net/forum?id=BTKAeLqLMw))。
可参考[图1(右图)](https://tongyx361.github.io/assets/dart-math/main-nresp-vs-query.png),分别查看由`DARS-Uniform`生成的`DART-Math-Uniform`与由`DARS-Prop2Diff`生成的`DART-Math-Hard`的示例。
## 引用说明
如果您认为本项目的数据、模型或代码对您的工作有所帮助,请引用我们的论文:
latex
@article{tong2024dartmath,
title={DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving},
author={Yuxuan Tong and Xiwen Zhang and Rui Wang and Ruidong Wu and Junxian He},
year={2024},
eprint={2407.13690},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2407.13690},
}
提供机构:
maas
创建时间:
2025-02-17



