vrt-baseline
收藏魔搭社区2025-10-09 更新2025-02-22 收录
下载链接:
https://modelscope.cn/datasets/hkust-nlp/vrt-baseline
下载链接
链接失效反馈官方服务:
资源简介:
> [!NOTE]
> This dataset is the **VRT baseline** dataset used to train baseline models `*-VRT` in Table 2 of the paper.
> > Another ablation baseline to DART is vanilla rejection tuning (VRT), where we synthesize a dataset of the same size of 0.59M examples with DeepSeekMath-7B-RL, using vanilla rejection sampling as described in §2.1.
# 🎯 DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving
📝 [Paper@arXiv](https://arxiv.org/abs/2407.13690) | 🤗 [Datasets&Models@HF](https://huggingface.co/collections/hkust-nlp/dart-math-665704599b35de59f8fdf6c1) | 🐱 [Code@GitHub](https://github.com/hkust-nlp/dart-math)
🐦 [Thread@X(Twitter)](https://x.com/tongyx361/status/1811413243350454455) | 🐶 [中文博客@知乎](https://zhuanlan.zhihu.com/p/708371895) | 📊 [Leaderboard@PapersWithCode](https://paperswithcode.com/paper/dart-math-difficulty-aware-rejection-tuning#results) | 📑 [BibTeX](https://github.com/hkust-nlp/dart-math?tab=readme-ov-file#citation)
> [!IMPORTANT]
> 🔥 Excited to find **[our `DART-Math-DSMath-7B` (Prop2Diff)](https://huggingface.co/hkust-nlp/dart-math-dsmath-7b-prop2diff) trained on [`DART-Math-Hard`](https://huggingface.co/datasets/hkust-nlp/dart-math-hard) [comparable](https://github.com/project-numina/aimo-progress-prize/blob/main/report/numina_dataset.pdf) to the AIMO winner [NuminaMath-7B](https://huggingface.co/AI-MO/NuminaMath-7B-CoT)** on CoT,
> but based solely on [MATH](https://huggingface.co/datasets/hkust-nlp/dart-math-pool-math-query-info) & [GSM8K](https://huggingface.co/datasets/hkust-nlp/dart-math-pool-gsm8k-query-info) prompt set, leaving much room to improve!
> Besides, our [`DART` method](https://github.com/hkust-nlp/dart-math?tab=readme-ov-file#dars--difficulty-aware-rejection-sampling) is also fully compatible with [tool-integrated reasoning](https://github.com/hkust-nlp/dart-math?tab=readme-ov-file#tool-integrated-reasoning-reasoning-in-natural-language-interleaved-with-python-code).
> Find more details and join the discussion under this [X thread](https://x.com/tongyx361/status/1815112376649134172)!
## Datasets: `DART-Math`
`DART-Math` datasets are the **state-of-the-art** and **data-efficient** **open-source** instruction tuning datasets for mathematical reasoning.
.container {
display: flex;
justify-content: space-around;
}
.container img {
max-width: 45%;
height: auto;
}
.caption {
text-align: center;
font-size: small;
margin-top: 10px;
}
Figure 1: Left: Average accuracy on 6 mathematical benchmarks. We compare with models fine-tuned on the best, public instruction tuning datasets for mathematical problem-solving:
MetaMath (Yu et al., 2024) with 395K
examples,
MMIQC (Liu et al., 2024a) with 2.3 million examples,
as well as vanilla rejection tuning (VRT) with 590K examples.
Both DART-Math (Uniform) and DART-Math (Prop2Diff) use 590K training examples.
Right: Number of responses for each query descending by difficulty across 3 synthesis strategies.
Queries are from the MATH training split (Hendrycks et al., 2021).
VRT is the baseline biased towards easy queries, while Uniform and Prop2Diff are proposed in this work to balance and bias towards difficult queries respectively.
Points are slightly shifted and downsampled for clarity.
`DART-Math-Hard` contains \~585k mathematical QA pair samples constructed by applying `DARS-Prop2Diff` to the query set from MATH and GSK8K training sets, achieves **SOTA** on many challenging mathematical reasoning benchmarks. It introduces a **deliberate bias towards hard queries**, opposite to vanilla rejection sampling.
Performance produced by `DART-Math-Hard` is usually but not necessarily **slightly better (\~1% absolutely)** than `DART-Math-Uniform`, which contains \~591k samples constructed by applying `DARS-Uniform`.
### Comparison between Mathematical Instruction Tuning Datasets
Most of previous datasets are **constructed with ChatGPT**, and many of them are **not open-source**, especially for ones of the best performance.
| Math SFT Dataset | # of Samples | [MATH](https://huggingface.co/datasets/hendrycks/competition_math) | [GSM8K](https://huggingface.co/datasets/gsm8k) | [College](https://github.com/hkust-nlp/dart-math/tree/main/data/eval-dsets/mwpbench/college-math-test.jsonl) | Synthesis Agent(s) | Open-Source |
| :--------------------------------------------------------------------------------- | -----------: | -----------------------------------------------------------------: | ---------------------------------------------: | -----------------------------------------------------------------------------------------------------------: | :---------------------- | :-------------------------------------------------------------------------: |
| [WizardMath](https://arxiv.org/abs/2308.09583) | 96k | 32.3 | 80.4 | 23.1 | GPT-4 | ✗ |
| [MetaMathQA](https://arxiv.org/abs/2309.12284) | 395k | 29.8 | 76.5 | 19.3 | GPT-3.5 | [✓](https://huggingface.co/datasets/meta-math/MetaMathQA) |
| [MMIQC](https://arxiv.org/abs/2401.09003) | **2294k** | 37.4 | 75.4 | _28.5_ | **GPT-4+GPT-3.5+Human** | [**✓**](https://huggingface.co/datasets/Vivacem/MMIQC) |
| [Orca-Math](https://arxiv.org/abs/2402.14830) | 200k | -- | -- | -- | GPT-4 | [✓](https://huggingface.co/datasets/microsoft/orca-math-word-problems-200k) |
| [Xwin-Math-V1.1](https://arxiv.org/abs/2403.04706) | **1440k** | _45.5_ | **84.9** | 27.6 | **GPT-4** | **✗** |
| [KPMath-Plus](https://arxiv.org/abs/2403.02333) | **1576k** | **46.8** | 82.1 | -– | **GPT-4** | **✗** |
| [MathScaleQA](https://arxiv.org/abs/2403.02884) | 2021k | 35.2 | 74.8 | 21.8 | GPT-3.5+Human | ✗ |
| [`DART-Math-Uniform`](https://huggingface.co/datasets/hkust-nlp/dart-math-uniform) | **591k** | 43.5 | _82.6_ | 26.9 | **DeepSeekMath-7B-RL** | [**✓**](https://huggingface.co/datasets/hkust-nlp/dart-math-uniform) |
| [`DART-Math-Hard`](https://huggingface.co/datasets/hkust-nlp/dart-math-hard) | **585k** | _45.5_ | 81.1 | **29.4** | **DeepSeekMath-7B-RL** | [**✓**](https://huggingface.co/datasets/hkust-nlp/dart-math-hard) |
MATH and GSM8K are **in-domain**, while College(Math) is **out-of-domain**. Performance here are of models fine-tuned from [Mistral-7B](https://huggingface.co/mistralai/Mistral-7B-v0.1), except for Xwin-Math-V1.1 based on [Llama2-7B](https://huggingface.co/meta-llama/Llama-2-7b-hf). **Bold**/_Italic_ means the best/second best score here.
## Dataset Construction: `DARS` - Difficulty-Aware Rejection Sampling
Previous works usually synthesize data from proprietary models to augment existing datasets, followed by instruction tuning to achieve top-tier results.
However, our analysis of these datasets reveals **severe biases towards easy queries, with frequent failures to generate any correct response for the most challenging queries**.
Motivated by the observation above, we propose to *Difficulty-Aware Rejection Sampling* (`DARS`), to collect more responses for more difficult queries.
Specifically, we introduce two strategies to increase the number of correct responses for difficult queries:
1) **Uniform**, which involves sampling responses for each query until **each query accumulates $k_u$ correct
responses**, where $k_u$ is a preset hyperparameter determined by the desired size of the synthetic dataset;
2) **Prop2Diff**, where we continue sampling responses until the number of correct responses for each
query is **proportional to its difficulty score**. The most challenging queries will receive $k_p$ responses
and kp is a hyperparameter. This method introduces a deliberate bias in the opposite direction to
vanilla rejection sampling, towards more difficult queries, inspired by previous works
that demonstrate **difficult samples can be more effective to enhance model capabilities** ([Sorscher et al.,
2022](https://proceedings.neurips.cc/paper_files/paper/2022/hash/7b75da9b61eda40fa35453ee5d077df6-Abstract-Conference.html); [Liu et al., 2024b](https://openreview.net/forum?id=BTKAeLqLMw)).
See [Figure 1 (Right)](https://tongyx361.github.io/assets/dart-math/main-nresp-vs-query.png) for examples of `DART-Math-Uniform` by `DARS-Uniform` and `DART-Math-Hard` by `DARS-Prop2Diff`.
## Citation
If you find our data, model or code useful for your work, please kindly cite [our paper](https://arxiv.org/abs/2407.13690):
```latex
@article{tong2024dartmath,
title={DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving},
author={Yuxuan Tong and Xiwen Zhang and Rui Wang and Ruidong Wu and Junxian He},
year={2024},
eprint={2407.13690},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2407.13690},
}
```
> [!NOTE]
> 本数据集为本文表2中用于训练基线模型`*-VRT`的**纯拒绝微调(vanilla rejection tuning, VRT)基线数据集**。
> > 针对DART的另一项消融基线为纯拒绝微调(VRT):我们使用DeepSeekMath-7B-RL模型,按照§2.1中描述的纯拒绝采样方法,合成了规模同为59万的样本数据集。
# 🎯 DART-Math:面向数学解题的难度感知拒绝微调(Difficulty-Aware Rejection Tuning, DART)
📝 [论文@arXiv](https://arxiv.org/abs/2407.13690) | 🤗 [数据集与模型@Hugging Face (HF)](https://huggingface.co/collections/hkust-nlp/dart-math-665704599b35de59f8fdf6c1) | 🐱 [代码@GitHub](https://github.com/hkust-nlp/dart-math)
🐦 [X平台讨论帖](https://x.com/tongyx361/status/1811413243350454455) | 🐶 [中文博客@知乎](https://zhuanlan.zhihu.com/p/708371895) | 📊 [基准测试排行榜@PapersWithCode](https://paperswithcode.com/paper/dart-math-difficulty-aware-rejection-tuning#results) | 📑 [引用格式(BibTeX)](https://github.com/hkust-nlp/dart-math?tab=readme-ov-file#citation)
> [!IMPORTANT]
> 🔥 令人振奋的是,我们在[`DART-Math-Hard`](https://huggingface.co/datasets/hkust-nlp/dart-math-hard)数据集上训练的**[`DART-Math-DSMath-7B` (Prop2Diff)](https://huggingface.co/hkust-nlp/dart-math-dsmath-7b-prop2diff)**模型在思维链(Chain of Thought, CoT)任务上的表现,可与AIMO冠军模型[NuminaMath-7B](https://huggingface.co/AI-MO/NuminaMath-7B-CoT)相媲美(性能对比详见[报告](https://github.com/project-numina/aimo-progress-prize/blob/main/report/numina_dataset.pdf)),但该模型仅基于[MATH](https://huggingface.co/datasets/hkust-nlp/dart-math-pool-math-query-info)与[GSM8K](https://huggingface.co/datasets/hkust-nlp/dart-math-pool-gsm8k-query-info)的提示集构建,仍有较大的优化空间!
> 此外,我们提出的[`DART`方法](https://github.com/hkust-nlp/dart-math?tab=readme-ov-file#dars--difficulty-aware-rejection-sampling)还完全兼容[工具整合推理](https://github.com/hkust-nlp/dart-math?tab=readme-ov-file#tool-integrated-reasoning-reasoning-in-natural-language-interleaved-with-python-code),即自然语言与Python代码交错的推理方式。更多细节可移步至该[X平台讨论帖](https://x.com/tongyx361/status/1815112376649134172)参与交流!
## 数据集:`DART-Math`
`DART-Math`系列数据集是当前面向数学推理的**开源指令微调数据集**中的佼佼者,且具备**数据高效性**。
.container {
display: flex;
justify-content: space-around;
}
.container img {
max-width: 45%;
height: auto;
}
.caption {
text-align: center;
font-size: small;
margin-top: 10px;
}
图1:左图为6个数学基准数据集上的平均准确率。我们将其与基于当前最优的公开数学解题指令微调数据集训练的模型进行对比:包含39.5万样本的MetaMath(Yu等人,2024)、包含230万样本的MMIQC(Liu等人,2024a),以及包含59万样本的纯拒绝微调(VRT)模型。`DART-Math (Uniform)`与`DART-Math (Prop2Diff)`均使用59万训练样本。右图为3种合成策略下,按难度降序排列的每个查询的响应数量。查询样本取自MATH训练集(Hendrycks等人,2021)。VRT基线偏向简单查询,而本文提出的Uniform与Prop2Diff策略则分别实现了查询样本的均衡分布与偏向困难查询的分布。为提升可读性,数据点进行了小幅偏移与下采样处理。
`DART-Math-Hard`数据集包含约58.5万道数学问答样本,其通过将`DARS-Prop2Diff`策略应用于MATH与GSM8K训练集的查询集构建而成,在多项高难度数学推理基准数据集上达到了**当前最优(State-of-the-Art, SOTA)**性能。与纯拒绝采样策略相反,该数据集**刻意偏向困难查询样本**。基于`DART-Math-Hard`训练的模型性能通常(但并非绝对)比`DART-Math-Uniform`高出约1个百分点(绝对精度);后者包含约59.1万样本,通过`DARS-Uniform`策略构建。
### 数学指令微调数据集对比
大多数过往数据集均通过ChatGPT构建,且其中许多数据集**未开源**,尤其是性能顶尖的数据集。
| 数学SFT数据集 | 样本数量 | [MATH](https://huggingface.co/datasets/hendrycks/competition_math)准确率 | [GSM8K](https://huggingface.co/datasets/gsm8k)准确率 | 大学数学测试集准确率 | 合成代理(s) | 是否开源 |
| :--------------------------------------------------------------------------------- | -----------: | -----------------------------------------------------------------: | ---------------------------------------------: | -----------------------------------------------------------------------------------------------------------: | :---------------------- | :-------------------------------------------------------------------------: |
| [WizardMath](https://arxiv.org/abs/2308.09583) | 9.6万 | 32.3 | 80.4 | 23.1 | GPT-4 | ✗ |
| [MetaMathQA](https://arxiv.org/abs/2309.12284) | 39.5万 | 29.8 | 76.5 | 19.3 | GPT-3.5 | [✓](https://huggingface.co/datasets/meta-math/MetaMathQA) |
| [MMIQC](https://arxiv.org/abs/2401.09003) | **229.4万** | 37.4 | 75.4 | _28.5_ | **GPT-4+GPT-3.5+Human** | [**✓**](https://huggingface.co/datasets/Vivacem/MMIQC) |
| [Orca-Math](https://arxiv.org/abs/2402.14830) | 20万 | -- | -- | -- | GPT-4 | [✓](https://huggingface.co/datasets/microsoft/orca-math-word-problems-200k) |
| [Xwin-Math-V1.1](https://arxiv.org/abs/2403.04706) | **144万** | _45.5_ | **84.9** | 27.6 | **GPT-4** | **✗** |
| [KPMath-Plus](https://arxiv.org/abs/2403.02333) | **157.6万** | **46.8** | 82.1 | -– | **GPT-4** | **✗** |
| [MathScaleQA](https://arxiv.org/abs/2403.02884) | 202.1万 | 35.2 | 74.8 | 21.8 | GPT-3.5+Human | ✗ |
| [`DART-Math-Uniform`](https://huggingface.co/datasets/hkust-nlp/dart-math-uniform) | **59.1万** | 43.5 | _82.6_ | 26.9 | **DeepSeekMath-7B-RL** | [**✓**](https://huggingface.co/datasets/hkust-nlp/dart-math-uniform) |
| [`DART-Math-Hard`](https://huggingface.co/datasets/hkust-nlp/dart-math-hard) | **58.5万** | _45.5_ | 81.1 | **29.4** | **DeepSeekMath-7B-RL** | [**✓**](https://huggingface.co/datasets/hkust-nlp/dart-math-hard) |
MATH与GSM8K为**域内数据集**,而大学数学测试集为**域外数据集**。本次实验的模型均基于[Mistral-7B](https://huggingface.co/mistralai/Mistral-7B-v0.1)微调得到,仅Xwin-Math-V1.1基于[Llama2-7B](https://huggingface.co/meta-llama/Llama-2-7b-hf)构建。表格中**加粗**表示最优成绩,*斜体*表示次优成绩。
## 数据集构建:`DARS`——难度感知拒绝采样(Difficulty-Aware Rejection Sampling, DARS)
以往研究通常从闭源模型中合成数据以扩充现有数据集,随后通过指令微调获得顶尖性能。但我们对这些数据集的分析显示,它们**存在严重的简单查询偏向问题,且在处理最具挑战性的查询时,经常无法生成任何正确响应**。
基于上述观察,我们提出**难度感知拒绝采样(Difficulty-Aware Rejection Sampling, 以下简称DARS)**,旨在为困难查询收集更多有效响应。具体而言,我们提出两种策略以提升困难查询的正确响应数量:
1) **Uniform策略**:对每个查询进行响应采样,直至**该查询累计获得$k_u$个正确响应**,其中$k_u$为预设超参数,其取值由合成数据集的目标规模决定;
2) **Prop2Diff策略**:持续对查询进行响应采样,直至每个查询的正确响应数量**与其难度得分成正比**。难度最高的查询将获得$k_p$个响应,$k_p$同样为预设超参数。该策略刻意采用与纯拒绝采样相反的偏向方向,即偏向困难查询,其灵感来源于过往研究表明**困难样本可更有效地提升模型性能**(Sorscher等人,2022;Liu等人,2024b)。
可参阅[图1(右图)](https://tongyx361.github.io/assets/dart-math/main-nresp-vs-query.png),了解基于`DARS-Uniform`构建的`DART-Math-Uniform`与基于`DARS-Prop2Diff`构建的`DART-Math-Hard`示例。
## 引用
若您的研究用到了本数据集、模型或代码,请引用本文:
latex
@article{tong2024dartmath,
title={DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving},
author={Yuxuan Tong and Xiwen Zhang and Rui Wang and Ruidong Wu and Junxian He},
year={2024},
eprint={2407.13690},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2407.13690},
}
提供机构:
maas
创建时间:
2025-02-17



