OpenMathInstruct-2
收藏魔搭社区2026-05-16 更新2024-10-12 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/OpenMathInstruct-2
下载链接
链接失效反馈官方服务:
资源简介:
# OpenMathInstruct-2
OpenMathInstruct-2 is a math instruction tuning dataset with 14M problem-solution pairs
generated using the [Llama3.1-405B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-405B-Instruct) model.
The training set problems of [GSM8K](https://github.com/openai/grade-school-math)
and [MATH](https://github.com/hendrycks/math) are used for constructing the dataset in the following ways:
- *Solution augmentation*: Generating chain-of-thought solutions for training set problems in GSM8K and MATH.
- *Problem-Solution augmentation*: Generating new problems, followed by solutions for these new problems.
<p>
<img src="SFT Data Diagram 1.jpg" width="75%" title="Composition of OpenMathInstruct-2">
</p>
OpenMathInstruct-2 dataset contains the following fields:
- **problem**: Original problem from either the GSM8K or MATH training set or augmented problem from these training sets.
- **generated_solution**: Synthetically generated solution.
- **expected_answer**: For problems in the training set, it is the ground-truth answer provided in the datasets. **For augmented problems, it is the majority-voting answer.**
- **problem_source**: Whether the problem is taken directly from GSM8K or MATH or is an augmented version derived from either dataset.
<p>
<img src="scaling_plot.jpg" width="40%" title="Scaling Curve">
</p>
We also release the 1M, 2M, and 5M, *fair-downsampled* versions of the entire training set corresponding to points in the above scaling plot.
These splits are referred to as **train_1M**, **train_2M**, and **train_5M**.
To use these subsets, just specify one of these subsets as split while downloading the data:
```python
from datasets import load_dataset
# Download only the 1M training split
dataset = load_dataset('nvidia/OpenMathInstruct-2', split='train_1M', streaming=True)
```
To download the entire training set and to convert it into the jsonl format, use the following code snippet.
This might take 20-30 minutes (or more depending on your network connection) and will use ~20Gb of RAM.
```python
import json
from datasets import load_dataset
from tqdm import tqdm
dataset = load_dataset('nvidia/OpenMathInstruct-2', split='train')
print("Converting dataset to jsonl format")
output_file = "openmathinstruct2.jsonl"
with open(output_file, 'w', encoding='utf-8') as f:
for item in tqdm(dataset):
f.write(json.dumps(item, ensure_ascii=False) + '\n')
print(f"Conversion complete. Output saved as {output_file}")
```
Apart from the dataset, we also release the [contamination explorer](https://huggingface.co/spaces/nvidia/OpenMathInstruct-2-explorer) for looking at problems
in the OpenMathInstruct-2 dataset that are similar to the [GSM8K](https://huggingface.co/datasets/openai/gsm8k), [MATH](https://github.com/hendrycks/math),
[AMC 2023](https://github.com/QwenLM/Qwen2.5-Math/tree/main/evaluation/data/amc23), [AIME 2024](https://artofproblemsolving.com/wiki/index.php/2024_AIME_I),
and [Omni-MATH](https://huggingface.co/datasets/KbsdJames/Omni-MATH) test set problems.
See our [paper](https://arxiv.org/abs/2410.01560) to learn more details!
### Note
The released dataset doesn't filter out extremely long questions. After the dataset release, we found that 564 questions (roughly 0.1%) were longer than 1024 Llama tokens.
We experimented with removing these questions and didn't see a performance drop (in fact, we observed a minor bump). Dropping these questions, helps with memory as well.
So we would recommend, filtering out extremely long questions. We have updated the data preparation commands in our [Github documentation](https://nvidia.github.io/NeMo-Skills/openmathinstruct2/dataset/#converting-to-sft-format).
## OpenMath2 models
To demonstrate the quality of this dataset, we release a series of OpenMath2 models trained on this data.
| Model | GSM8K | MATH | AMC 2023 | AIME 2024 | Omni-MATH |
|:---|:---:|:---:|:---:|:---:|:---:|
| Llama3.1-8B-Instruct | 84.5 | 51.9 | 9/40 | 2/30 | 12.7 |
| OpenMath2-Llama3.1-8B ([nemo](https://huggingface.co/nvidia/OpenMath2-Llama3.1-8B-nemo) \| [HF](https://huggingface.co/nvidia/OpenMath2-Llama3.1-8B)) | 91.7 | 67.8 | 16/40 | 3/30 | 22.0 |
| + majority@256 | 94.1 | 76.1 | 23/40 | 3/30 | 24.6 |
| Llama3.1-70B-Instruct | 95.8 | 67.9 | 19/40 | 6/30 | 19.0 |
| OpenMath2-Llama3.1-70B ([nemo](https://huggingface.co/nvidia/OpenMath2-Llama3.1-70B-nemo) \| [HF](https://huggingface.co/nvidia/OpenMath2-Llama3.1-70B)) | 94.9 | 71.9 | 20/40 | 4/30 | 23.1 |
| + majority@256 | 96.0 | 79.6 | 24/40 | 6/30 | 27.6 |
The pipeline we used to produce the data and models is fully open-sourced!
- [Code](https://github.com/NVIDIA/NeMo-Skills)
- [Models](https://huggingface.co/collections/nvidia/openmath-2-66fb142317d86400783d2c7b)
- [Dataset](https://huggingface.co/datasets/nvidia/OpenMathInstruct-2)
## Reproducing our results
We provide [all instructions](https://nvidia.github.io/NeMo-Skills/openmathinstruct2/)
to fully reproduce our results, including data generation.
## Citation
If you find our work useful, please consider citing us!
```bibtex
@article{toshniwal2024openmath2,
title = {OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data},
author = {Shubham Toshniwal and Wei Du and Ivan Moshkov and Branislav Kisacanin and Alexan Ayrapetyan and Igor Gitman},
year = {2024},
journal = {arXiv preprint arXiv:2410.01560}
}
```
# OpenMathInstruct-2
OpenMathInstruct-2 是一款数学指令微调数据集,包含1400万条问题-解答对,由 [Llama3.1-405B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-405B-Instruct) 模型生成。
本数据集基于 [GSM8K](https://github.com/openai/grade-school-math) 与 [MATH](https://github.com/hendrycks/math) 的训练集问题构建,具体构建方式如下:
- *解答增强(Solution augmentation)*:为GSM8K与MATH训练集的问题生成思维链(chain-of-thought)解答。
- *问题-解答增强(Problem-Solution augmentation)*:先生成全新的问题,再为这些新问题生成对应的解答。
<p>
<img src="SFT Data Diagram 1.jpg" width="75%" title="OpenMathInstruct-2 的组成结构">
</p>
OpenMathInstruct-2 数据集包含以下字段:
- **problem(问题)**:源自GSM8K或MATH训练集的原始问题,或是基于上述训练集生成的增强问题。
- **generated_solution(生成解答)**:由模型合成生成的解答。
- **expected_answer(预期答案)**:对于训练集原始问题,该字段为数据集自带的标准答案;对于增强生成的问题,该字段为多数投票得到的答案。
- **problem_source(问题来源)**:标识该问题是直接取自GSM8K或MATH训练集,还是由任一数据集衍生的增强版本。
<p>
<img src="scaling_plot.jpg" width="40%" title="缩放曲线(Scaling Curve)">
</p>
我们还发布了对应上述缩放曲线中采样点的100万、200万及500万条数据的*公平下采样(fair-downsampled)*版本完整训练集。这些子集分别命名为**train_1M**、**train_2M**和**train_5M**。如需使用这些子集,仅需在下载数据时将对应子集指定为拆分方式即可:
python
from datasets import load_dataset
# 仅下载100万条训练子集
dataset = load_dataset('nvidia/OpenMathInstruct-2', split='train_1M', streaming=True)
如需下载完整训练集并将其转换为jsonl格式,可使用以下代码片段。该过程可能需要20-30分钟(具体时长取决于网络状况),并将占用约20GB的内存空间。
python
import json
from datasets import load_dataset
from tqdm import tqdm
dataset = load_dataset('nvidia/OpenMathInstruct-2', split='train')
print("正在将数据集转换为jsonl格式")
output_file = "openmathinstruct2.jsonl"
with open(output_file, 'w', encoding='utf-8') as f:
for item in tqdm(dataset):
f.write(json.dumps(item, ensure_ascii=False) + '
')
print(f"转换完成,输出文件已保存至 {output_file}")
除本数据集外,我们还发布了[污染探查工具(contamination explorer)](https://huggingface.co/spaces/nvidia/OpenMathInstruct-2-explorer),用于查询OpenMathInstruct-2数据集中与[GSM8K](https://huggingface.co/datasets/openai/gsm8k)、[MATH](https://github.com/hendrycks/math)、[AMC 2023](https://github.com/QwenLM/Qwen2.5-Math/tree/main/evaluation/data/amc23)、[AIME 2024](https://artofproblemsolving.com/wiki/index.php/2024_AIME_I)及[Omni-MATH](https://huggingface.co/datasets/KbsdJames/Omni-MATH)测试集问题相似的题目。
如需了解更多细节,请参阅我们的[论文](https://arxiv.org/abs/2410.01560)!
### 注意事项
本次发布的数据集未过滤超长问题。数据集发布后,我们发现共有564条问题(约占总数据的0.1%)长度超过1024个Llama Token。我们曾尝试移除这些问题,结果并未出现性能下降(实际上还观察到了小幅性能提升)。移除超长问题同时也能节省内存占用,因此我们建议对数据集进行超长问题过滤。我们已在[GitHub文档](https://nvidia.github.io/NeMo-Skills/openmathinstruct2/dataset/#converting-to-sft-format)中更新了数据预处理命令。
## OpenMath2 模型
| 模型 | GSM8K | MATH | AMC 2023 | AIME 2024 | Omni-MATH |
|:---|:---:|:---:|:---:|:---:|:---:|
| Llama3.1-8B-Instruct | 84.5 | 51.9 | 9/40 | 2/30 | 12.7 |
| OpenMath2-Llama3.1-8B ([nemo](https://huggingface.co/nvidia/OpenMath2-Llama3.1-8B-nemo) | [HF](https://huggingface.co/nvidia/OpenMath2-Llama3.1-8B)) | 91.7 | 67.8 | 16/40 | 3/30 | 22.0 |
| + 256票多数投票(majority@256) | 94.1 | 76.1 | 23/40 | 3/30 | 24.6 |
| Llama3.1-70B-Instruct | 95.8 | 67.9 | 19/40 | 6/30 | 19.0 |
| OpenMath2-Llama3.1-70B ([nemo](https://huggingface.co/nvidia/OpenMath2-Llama3.1-70B-nemo) | [HF](https://huggingface.co/nvidia/OpenMath2-Llama3.1-70B)) | 94.9 | 71.9 | 20/40 | 4/30 | 23.1 |
| + 256票多数投票(majority@256) | 96.0 | 79.6 | 24/40 | 6/30 | 27.6 |
我们用于生成数据集与模型的完整流程已完全开源:
- [代码仓库](https://github.com/NVIDIA/NeMo-Skills)
- [模型权重](https://huggingface.co/collections/nvidia/openmath-2-66fb142317d86400783d2c7b)
- [数据集](https://huggingface.co/datasets/nvidia/OpenMathInstruct-2)
## 结果复现
我们提供了[完整流程说明](https://nvidia.github.io/NeMo-Skills/openmathinstruct2/),可用于完全复现我们的实验结果,包括数据生成步骤。
## 引用
如果您认为我们的工作对您有帮助,请考虑引用我们的论文:
bibtex
@article{toshniwal2024openmath2,
title = {OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data},
author = {Shubham Toshniwal and Wei Du and Ivan Moshkov and Branislav Kisacanin and Alexan Ayrapetyan and Igor Gitman},
year = {2024},
journal = {arXiv preprint arXiv:2410.01560}
}
提供机构:
maas
创建时间:
2024-10-09



