下载链接：

https://modelscope.cn/datasets/hkust-nlp/dart-math-pool-math-query-info

下载链接

链接失效反馈

官方服务：

资源简介：

> [!NOTE] > This dataset is the **synthesis information of queries** from the **MATH** training set, > such as the numbers of raw/correct samples of each synthesis job. > Usually used with `dart-math-pool-math`. # 🎯 DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving 📝 [Paper@arXiv](https://arxiv.org/abs/2407.13690) | 🤗 [Datasets&Models@HF](https://huggingface.co/collections/hkust-nlp/dart-math-665704599b35de59f8fdf6c1) | 🐱 [Code@GitHub](https://github.com/hkust-nlp/dart-math) 🐦 [Thread@X(Twitter)](https://x.com/tongyx361/status/1811413243350454455) | 🐶 [中文博客@知乎](https://zhuanlan.zhihu.com/p/708371895) | 📊 [Leaderboard@PapersWithCode](https://paperswithcode.com/paper/dart-math-difficulty-aware-rejection-tuning#results) | 📑 [BibTeX](https://github.com/hkust-nlp/dart-math?tab=readme-ov-file#citation) ## Datasets: `DART-Math` `DART-Math` datasets are the **state-of-the-art** and **data-efficient** **open-source** instruction tuning datasets for mathematical reasoning. .container { display: flex; justify-content: space-around; } .container img { max-width: 45%; height: auto; } .caption { text-align: center; font-size: small; margin-top: 10px; } Figure 1: Left: Average accuracy on 6 mathematical benchmarks. We compare with models fine-tuned on the best, public instruction tuning datasets for mathematical problem-solving: MetaMath (Yu et al., 2024) with 395K examples, MMIQC (Liu et al., 2024a) with 2.3 million examples, as well as vanilla rejection tuning (VRT) with 590K examples. Both DART-Math (Uniform) and DART-Math (Prop2Diff) use 590K training examples. Right: Number of responses for each query descending by difficulty across 3 synthesis strategies. Queries are from the MATH training split (Hendrycks et al., 2021). VRT is the baseline biased towards easy queries, while Uniform and Prop2Diff are proposed in this work to balance and bias towards difficult queries respectively. Points are slightly shifted and downsampled for clarity. `DART-Math-Hard` contains \~585k mathematical QA pair samples constructed by applying `DARS-Prop2Diff` to the query set from MATH and GSK8K training sets, achieves **SOTA** on many challenging mathematical reasoning benchmarks. It introduces a **deliberate bias towards hard queries**, opposite to vanilla rejection sampling. Performance produced by `DART-Math-Hard` is usually but not necessarily **slightly better (\~1% absolutely)** than `DART-Math-Uniform`, which contains \~591k samples constructed by applying `DARS-Uniform`. ### Comparison between Mathematical Instruction Tuning Datasets Most of previous datasets are **constructed with ChatGPT**, and many of them are **not open-source**, especially for ones of the best performance. | Math SFT Dataset | # of Samples | [MATH](https://huggingface.co/datasets/hendrycks/competition_math) | [GSM8K](https://huggingface.co/datasets/gsm8k) | [College](https://github.com/hkust-nlp/dart-math/tree/main/data/eval-dsets/mwpbench/college-math-test.jsonl) | Synthesis Agent(s) | Open-Source | | :--------------------------------------------------------------------------------- | -----------: | -----------------------------------------------------------------: | ---------------------------------------------: | -----------------------------------------------------------------------------------------------------------: | :---------------------- | :-------------------------------------------------------------------------: | | [WizardMath](https://arxiv.org/abs/2308.09583) | 96k | 32.3 | 80.4 | 23.1 | GPT-4 | ✗ | | [MetaMathQA](https://arxiv.org/abs/2309.12284) | 395k | 29.8 | 76.5 | 19.3 | GPT-3.5 | [✓](https://huggingface.co/datasets/meta-math/MetaMathQA) | | [MMIQC](https://arxiv.org/abs/2401.09003) | **2294k** | 37.4 | 75.4 | _28.5_ | **GPT-4+GPT-3.5+Human** | [**✓**](https://huggingface.co/datasets/Vivacem/MMIQC) | | [Orca-Math](https://arxiv.org/abs/2402.14830) | 200k | -- | -- | -- | GPT-4 | [✓](https://huggingface.co/datasets/microsoft/orca-math-word-problems-200k) | | [Xwin-Math-V1.1](https://arxiv.org/abs/2403.04706) | **1440k** | _45.5_ | **84.9** | 27.6 | **GPT-4** | **✗** | | [KPMath-Plus](https://arxiv.org/abs/2403.02333) | **1576k** | **46.8** | 82.1 | -– | **GPT-4** | **✗** | | [MathScaleQA](https://arxiv.org/abs/2403.02884) | 2021k | 35.2 | 74.8 | 21.8 | GPT-3.5+Human | ✗ | | [`DART-Math-Uniform`](https://huggingface.co/datasets/hkust-nlp/dart-math-uniform) | **591k** | 43.5 | _82.6_ | 26.9 | **DeepSeekMath-7B-RL** | [**✓**](https://huggingface.co/datasets/hkust-nlp/dart-math-uniform) | | [`DART-Math-Hard`](https://huggingface.co/datasets/hkust-nlp/dart-math-hard) | **585k** | _45.5_ | 81.1 | **29.4** | **DeepSeekMath-7B-RL** | [**✓**](https://huggingface.co/datasets/hkust-nlp/dart-math-hard) | MATH and GSM8K are **in-domain**, while College(Math) is **out-of-domain**. Performance here are of models fine-tuned from [Mistral-7B](https://huggingface.co/mistralai/Mistral-7B-v0.1), except for Xwin-Math-V1.1 based on [Llama2-7B](https://huggingface.co/meta-llama/Llama-2-7b-hf). **Bold**/_Italic_ means the best/second best score here. ## Dataset Construction: `DARS` - Difficulty-Aware Rejection Sampling Previous works usually synthesize data from proprietary models to augment existing datasets, followed by instruction tuning to achieve top-tier results. However, our analysis of these datasets reveals **severe biases towards easy queries, with frequent failures to generate any correct response for the most challenging queries**. Motivated by the observation above, we propose to *Difficulty-Aware Rejection Sampling* (`DARS`), to collect more responses for more difficult queries. Specifically, we introduce two strategies to increase the number of correct responses for difficult queries: 1) **Uniform**, which involves sampling responses for each query until **each query accumulates $k_u$ correct responses**, where $k_u$ is a preset hyperparameter determined by the desired size of the synthetic dataset; 2) **Prop2Diff**, where we continue sampling responses until the number of correct responses for each query is **proportional to its difficulty score**. The most challenging queries will receive $k_p$ responses and kp is a hyperparameter. This method introduces a deliberate bias in the opposite direction to vanilla rejection sampling, towards more difficult queries, inspired by previous works that demonstrate **difficult samples can be more effective to enhance model capabilities** ([Sorscher et al., 2022](https://proceedings.neurips.cc/paper_files/paper/2022/hash/7b75da9b61eda40fa35453ee5d077df6-Abstract-Conference.html); [Liu et al., 2024b](https://openreview.net/forum?id=BTKAeLqLMw)). See [Figure 1 (Right)](https://tongyx361.github.io/assets/dart-math/main-nresp-vs-query.png) for examples of `DART-Math-Uniform` by `DARS-Uniform` and `DART-Math-Hard` by `DARS-Prop2Diff`. ## Citation If you find our data, model or code useful for your work, please kindly cite [our paper](https://arxiv.org/abs/2407.13690): ```latex @article{tong2024dartmath, title={DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving}, author={Yuxuan Tong and Xiwen Zhang and Rui Wang and Ruidong Wu and Junxian He}, year={2024}, eprint={2407.13690}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2407.13690}, } ```

> [!NOTE] 本数据集为**MATH训练集（MATH）**的**查询语句合成信息**，例如各合成任务的原始样本/正确样本数量。通常需与`dart-math-pool-math`配合使用。 # 🎯 DART-Math：面向数学解题的难度感知拒绝调优 📝 [论文@arXiv](https://arxiv.org/abs/2407.13690) | 🤗 [数据集与模型@HF](https://huggingface.co/collections/hkust-nlp/dart-math-665704599b35de59f8fdf6c1) | 🐱 [代码@GitHub](https://github.com/hkust-nlp/dart-math) 🐦 [讨论帖@X(Twitter)](https://x.com/tongyx361/status/1811413243350454455) | 🐶 [中文博客@知乎](https://zhuanlan.zhihu.com/p/708371895) | 📊 [排行榜@PapersWithCode](https://paperswithcode.com/paper/dart-math-difficulty-aware-rejection-tuning#results) | 📑 [BibTeX引用](https://github.com/hkust-nlp/dart-math?tab=readme-ov-file#citation) ## 数据集：`DART-Math` `DART-Math` 系列数据集是目前**最先进**且**数据高效**的**开源**数学推理指令微调（instruction tuning）数据集。 css .container { display: flex; justify-content: space-around; } .container img { max-width: 45%; height: auto; } .caption { text-align: center; font-size: small; margin-top: 10px; } > 图1：左图：6个数学基准测试集上的平均准确率。我们与在当前最优的公开数学解题指令微调数据集上微调的模型进行对比：包含39.5万样本的MetaMath（Yu等人，2024）、包含230万样本的MMIQC（Liu等人，2024a），以及包含59万样本的vanilla拒绝调优（VRT）。DART-Math（Uniform）与DART-Math（Prop2Diff）均使用59万训练样本。右图：3种合成策略下，按难度降序排列的每个查询的响应数量。查询样本取自MATH训练划分（Hendrycks等人，2021）。VRT是偏向简单查询的基线方法，而本文提出的Uniform与Prop2Diff则分别实现查询样本的均衡分布与偏向难题的分布。为便于展示，数据点略有偏移并做了下采样处理。 `DART-Math-Hard` 包含约58.5万个数学问答对样本，通过将`DARS-Prop2Diff`应用于MATH与GSK8K训练集的查询集构建而成，在众多极具挑战性的数学推理基准测试中达到**当前最优（SOTA）**性能。该数据集**刻意偏向难题样本**，与传统拒绝采样的偏向相反。 `DART-Math-Hard` 的性能通常（但非绝对）比`DART-Math-Uniform`**高出约1%的绝对准确率**，后者包含约59.1万个通过`DARS-Uniform`构建的样本。 ### 数学指令微调数据集对比此前的多数数据集均**基于ChatGPT构建**，且其中许多**未开源**，尤其是性能顶尖的数据集。 | 数学监督微调数据集 | 样本数量 | [MATH](https://huggingface.co/datasets/hendrycks/competition_math) | [GSM8K](https://huggingface.co/datasets/gsm8k) | [大学数学](https://github.com/hkust-nlp/dart-math/tree/main/data/eval-dsets/mwpbench/college-math-test.jsonl) | 合成智能体 | 是否开源 | | :--------------------------------------------------------------------------------- | -------: | -----------------------------------------------------------------: | ---------------------------------------------: | -----------------------------------------------------------------------------------------------------------: | :------------------ | :-----------------------------------------------------------------------: | | [WizardMath](https://arxiv.org/abs/2308.09583) | 9.6万 | 32.3 | 80.4 | 23.1 | GPT-4 | ✗ | | [MetaMathQA](https://arxiv.org/abs/2309.12284) | 39.5万 | 29.8 | 76.5 | 19.3 | GPT-3.5 | [✓](https://huggingface.co/datasets/meta-math/MetaMathQA) | | [MMIQC](https://arxiv.org/abs/2401.09003) | **229.4万** | 37.4 | 75.4 | *28.5* | **GPT-4+GPT-3.5+人类** | [**✓**](https://huggingface.co/datasets/Vivacem/MMIQC) | | [Orca-Math](https://arxiv.org/abs/2402.14830) | 20.0万 | -- | -- | -- | GPT-4 | [✓](https://huggingface.co/datasets/microsoft/orca-math-word-problems-200k) | | [Xwin-Math-V1.1](https://arxiv.org/abs/2403.04706) | **144.0万** | *45.5* | **84.9** | 27.6 | **GPT-4** | **✗** | | [KPMath-Plus](https://arxiv.org/abs/2403.02333) | **157.6万** | **46.8** | 82.1 | -- | **GPT-4** | **✗** | | [MathScaleQA](https://arxiv.org/abs/2403.02884) | 202.1万 | 35.2 | 74.8 | 21.8 | GPT-3.5+人类 | ✗ | | [`DART-Math-Uniform`](https://huggingface.co/datasets/hkust-nlp/dart-math-uniform) | **59.1万** | 43.5 | *82.6* | 26.9 | **DeepSeekMath-7B-RL** | [**✓**](https://huggingface.co/datasets/hkust-nlp/dart-math-uniform) | | [`DART-Math-Hard`](https://huggingface.co/datasets/hkust-nlp/dart-math-hard) | **58.5万** | *45.5* | 81.1 | **29.4** | **DeepSeekMath-7B-RL** | [**✓**](https://huggingface.co/datasets/hkust-nlp/dart-math-hard) | MATH与GSM8K为**域内测试集**，大学数学为**域外测试集**。此处的性能指标均基于[Mistral-7B大语言模型（Large Language Model）](https://huggingface.co/mistralai/Mistral-7B-v0.1)微调得到的模型，Xwin-Math-V1.1除外，其基于[Llama2-7B大语言模型（Large Language Model）](https://huggingface.co/meta-llama/Llama-2-7b-hf)构建。**加粗**与*斜体*分别代表当前最优与次优得分。 ## 数据集构建：`DARS`——难度感知拒绝采样此前的研究通常通过专有模型合成数据以扩充现有数据集，随后通过指令微调获得顶尖性能。然而，我们对这些数据集的分析显示，**它们存在严重的简单查询偏向问题，且无法为极具挑战性的查询生成任何正确响应**。基于上述观察，我们提出了*难度感知拒绝采样*（`DARS`，Difficulty-Aware Rejection Sampling）方法，为难度更高的查询收集更多响应。具体而言，我们引入了两种策略以提升难题的正确响应数量： 1. **Uniform（均衡策略）**：对每个查询进行响应采样，直至**每个查询累计获得$k_u$个正确响应**，其中$k_u$为预设超参数，由合成数据集的期望规模决定； 2. **Prop2Diff（难度比例策略）**：持续对每个查询进行响应采样，直至其正确响应数量**与其难度得分成正比**。最具挑战性的查询将获得$k_p$个响应，$k_p$同样为超参数。受此前研究（[Sorscher等人，2022](https://proceedings.neurips.cc/paper_files/paper/2022/hash/7b75da9b61eda40fa35453ee5d077df6-Abstract-Conference.html); [Liu等人，2024b](https://openreview.net/forum?id=BTKAeLqLMw)）证实**难题样本可更有效地提升模型能力**的启发，该方法刻意采用与传统拒绝采样相反的偏向方向，即偏向更难的查询。可参考[图1（右图）](https://tongyx361.github.io/assets/dart-math/main-nresp-vs-query.png)，了解基于`DARS-Uniform`构建的`DART-Math-Uniform`与基于`DARS-Prop2Diff`构建的`DART-Math-Hard`示例。 ## 引用若您的研究中使用了本数据集、模型或代码，请引用[我们的论文](https://arxiv.org/abs/2407.13690)： latex @article{tong2024dartmath, title={DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving}, author={Yuxuan Tong and Xiwen Zhang and Rui Wang and Ruidong Wu and Junxian He}, year={2024}, eprint={2407.13690}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2407.13690}, }

应用场景：