OpenMathInstruct-2

Name: OpenMathInstruct-2
Creator: maas
Published: 2026-05-16 03:21:51
License: 暂无描述

魔搭社区2026-05-16 更新2024-10-12 收录

下载链接：

https://modelscope.cn/datasets/AI-ModelScope/OpenMathInstruct-2

下载链接

链接失效反馈

官方服务：

资源简介：

# OpenMathInstruct-2 OpenMathInstruct-2 is a math instruction tuning dataset with 14M problem-solution pairs generated using the [Llama3.1-405B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-405B-Instruct) model. The training set problems of [GSM8K](https://github.com/openai/grade-school-math) and [MATH](https://github.com/hendrycks/math) are used for constructing the dataset in the following ways: - *Solution augmentation*: Generating chain-of-thought solutions for training set problems in GSM8K and MATH. - *Problem-Solution augmentation*: Generating new problems, followed by solutions for these new problems. <img src="SFT Data Diagram 1.jpg" width="75%" title="Composition of OpenMathInstruct-2"> OpenMathInstruct-2 dataset contains the following fields: - **problem**: Original problem from either the GSM8K or MATH training set or augmented problem from these training sets. - **generated_solution**: Synthetically generated solution. - **expected_answer**: For problems in the training set, it is the ground-truth answer provided in the datasets. **For augmented problems, it is the majority-voting answer.** - **problem_source**: Whether the problem is taken directly from GSM8K or MATH or is an augmented version derived from either dataset. <img src="scaling_plot.jpg" width="40%" title="Scaling Curve"> We also release the 1M, 2M, and 5M, *fair-downsampled* versions of the entire training set corresponding to points in the above scaling plot. These splits are referred to as **train_1M**, **train_2M**, and **train_5M**. To use these subsets, just specify one of these subsets as split while downloading the data: ```python from datasets import load_dataset # Download only the 1M training split dataset = load_dataset('nvidia/OpenMathInstruct-2', split='train_1M', streaming=True) ``` To download the entire training set and to convert it into the jsonl format, use the following code snippet. This might take 20-30 minutes (or more depending on your network connection) and will use ~20Gb of RAM. ```python import json from datasets import load_dataset from tqdm import tqdm dataset = load_dataset('nvidia/OpenMathInstruct-2', split='train') print("Converting dataset to jsonl format") output_file = "openmathinstruct2.jsonl" with open(output_file, 'w', encoding='utf-8') as f: for item in tqdm(dataset): f.write(json.dumps(item, ensure_ascii=False) + '\n') print(f"Conversion complete. Output saved as {output_file}") ``` Apart from the dataset, we also release the [contamination explorer](https://huggingface.co/spaces/nvidia/OpenMathInstruct-2-explorer) for looking at problems in the OpenMathInstruct-2 dataset that are similar to the [GSM8K](https://huggingface.co/datasets/openai/gsm8k), [MATH](https://github.com/hendrycks/math), [AMC 2023](https://github.com/QwenLM/Qwen2.5-Math/tree/main/evaluation/data/amc23), [AIME 2024](https://artofproblemsolving.com/wiki/index.php/2024_AIME_I), and [Omni-MATH](https://huggingface.co/datasets/KbsdJames/Omni-MATH) test set problems. See our [paper](https://arxiv.org/abs/2410.01560) to learn more details! ### Note The released dataset doesn't filter out extremely long questions. After the dataset release, we found that 564 questions (roughly 0.1%) were longer than 1024 Llama tokens. We experimented with removing these questions and didn't see a performance drop (in fact, we observed a minor bump). Dropping these questions, helps with memory as well. So we would recommend, filtering out extremely long questions. We have updated the data preparation commands in our [Github documentation](https://nvidia.github.io/NeMo-Skills/openmathinstruct2/dataset/#converting-to-sft-format). ## OpenMath2 models To demonstrate the quality of this dataset, we release a series of OpenMath2 models trained on this data. | Model | GSM8K | MATH | AMC 2023 | AIME 2024 | Omni-MATH | |:---|:---:|:---:|:---:|:---:|:---:| | Llama3.1-8B-Instruct | 84.5 | 51.9 | 9/40 | 2/30 | 12.7 | | OpenMath2-Llama3.1-8B ([nemo](https://huggingface.co/nvidia/OpenMath2-Llama3.1-8B-nemo) \| [HF](https://huggingface.co/nvidia/OpenMath2-Llama3.1-8B)) | 91.7 | 67.8 | 16/40 | 3/30 | 22.0 | | + majority@256 | 94.1 | 76.1 | 23/40 | 3/30 | 24.6 | | Llama3.1-70B-Instruct | 95.8 | 67.9 | 19/40 | 6/30 | 19.0 | | OpenMath2-Llama3.1-70B ([nemo](https://huggingface.co/nvidia/OpenMath2-Llama3.1-70B-nemo) \| [HF](https://huggingface.co/nvidia/OpenMath2-Llama3.1-70B)) | 94.9 | 71.9 | 20/40 | 4/30 | 23.1 | | + majority@256 | 96.0 | 79.6 | 24/40 | 6/30 | 27.6 | The pipeline we used to produce the data and models is fully open-sourced! - [Code](https://github.com/NVIDIA/NeMo-Skills) - [Models](https://huggingface.co/collections/nvidia/openmath-2-66fb142317d86400783d2c7b) - [Dataset](https://huggingface.co/datasets/nvidia/OpenMathInstruct-2) ## Reproducing our results We provide [all instructions](https://nvidia.github.io/NeMo-Skills/openmathinstruct2/) to fully reproduce our results, including data generation. ## Citation If you find our work useful, please consider citing us! ```bibtex @article{toshniwal2024openmath2, title = {OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data}, author = {Shubham Toshniwal and Wei Du and Ivan Moshkov and Branislav Kisacanin and Alexan Ayrapetyan and Igor Gitman}, year = {2024}, journal = {arXiv preprint arXiv:2410.01560} } ```

# OpenMathInstruct-2 OpenMathInstruct-2 是一款数学指令微调数据集，包含1400万条问题-解答对，由 [Llama3.1-405B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-405B-Instruct) 模型生成。本数据集基于 [GSM8K](https://github.com/openai/grade-school-math) 与 [MATH](https://github.com/hendrycks/math) 的训练集问题构建，具体构建方式如下： - *解答增强（Solution augmentation）*：为GSM8K与MATH训练集的问题生成思维链（chain-of-thought）解答。 - *问题-解答增强（Problem-Solution augmentation）*：先生成全新的问题，再为这些新问题生成对应的解答。 <img src="SFT Data Diagram 1.jpg" width="75%" title="OpenMathInstruct-2 的组成结构"> OpenMathInstruct-2 数据集包含以下字段： - **problem（问题）**：源自GSM8K或MATH训练集的原始问题，或是基于上述训练集生成的增强问题。 - **generated_solution（生成解答）**：由模型合成生成的解答。 - **expected_answer（预期答案）**：对于训练集原始问题，该字段为数据集自带的标准答案；对于增强生成的问题，该字段为多数投票得到的答案。 - **problem_source（问题来源）**：标识该问题是直接取自GSM8K或MATH训练集，还是由任一数据集衍生的增强版本。 <img src="scaling_plot.jpg" width="40%" title="缩放曲线（Scaling Curve）"> 我们还发布了对应上述缩放曲线中采样点的100万、200万及500万条数据的*公平下采样（fair-downsampled）*版本完整训练集。这些子集分别命名为**train_1M**、**train_2M**和**train_5M**。如需使用这些子集，仅需在下载数据时将对应子集指定为拆分方式即可： python from datasets import load_dataset # 仅下载100万条训练子集 dataset = load_dataset('nvidia/OpenMathInstruct-2', split='train_1M', streaming=True) 如需下载完整训练集并将其转换为jsonl格式，可使用以下代码片段。该过程可能需要20-30分钟（具体时长取决于网络状况），并将占用约20GB的内存空间。 python import json from datasets import load_dataset from tqdm import tqdm dataset = load_dataset('nvidia/OpenMathInstruct-2', split='train') print("正在将数据集转换为jsonl格式") output_file = "openmathinstruct2.jsonl" with open(output_file, 'w', encoding='utf-8') as f: for item in tqdm(dataset): f.write(json.dumps(item, ensure_ascii=False) + ' ') print(f"转换完成，输出文件已保存至 {output_file}") 除本数据集外，我们还发布了[污染探查工具（contamination explorer）](https://huggingface.co/spaces/nvidia/OpenMathInstruct-2-explorer)，用于查询OpenMathInstruct-2数据集中与[GSM8K](https://huggingface.co/datasets/openai/gsm8k)、[MATH](https://github.com/hendrycks/math)、[AMC 2023](https://github.com/QwenLM/Qwen2.5-Math/tree/main/evaluation/data/amc23)、[AIME 2024](https://artofproblemsolving.com/wiki/index.php/2024_AIME_I)及[Omni-MATH](https://huggingface.co/datasets/KbsdJames/Omni-MATH)测试集问题相似的题目。如需了解更多细节，请参阅我们的[论文](https://arxiv.org/abs/2410.01560)！ ### 注意事项本次发布的数据集未过滤超长问题。数据集发布后，我们发现共有564条问题（约占总数据的0.1%）长度超过1024个Llama Token。我们曾尝试移除这些问题，结果并未出现性能下降（实际上还观察到了小幅性能提升）。移除超长问题同时也能节省内存占用，因此我们建议对数据集进行超长问题过滤。我们已在[GitHub文档](https://nvidia.github.io/NeMo-Skills/openmathinstruct2/dataset/#converting-to-sft-format)中更新了数据预处理命令。 ## OpenMath2 模型 | 模型 | GSM8K | MATH | AMC 2023 | AIME 2024 | Omni-MATH | |:---|:---:|:---:|:---:|:---:|:---:| | Llama3.1-8B-Instruct | 84.5 | 51.9 | 9/40 | 2/30 | 12.7 | | OpenMath2-Llama3.1-8B ([nemo](https://huggingface.co/nvidia/OpenMath2-Llama3.1-8B-nemo) | [HF](https://huggingface.co/nvidia/OpenMath2-Llama3.1-8B)) | 91.7 | 67.8 | 16/40 | 3/30 | 22.0 | | + 256票多数投票（majority@256） | 94.1 | 76.1 | 23/40 | 3/30 | 24.6 | | Llama3.1-70B-Instruct | 95.8 | 67.9 | 19/40 | 6/30 | 19.0 | | OpenMath2-Llama3.1-70B ([nemo](https://huggingface.co/nvidia/OpenMath2-Llama3.1-70B-nemo) | [HF](https://huggingface.co/nvidia/OpenMath2-Llama3.1-70B)) | 94.9 | 71.9 | 20/40 | 4/30 | 23.1 | | + 256票多数投票（majority@256） | 96.0 | 79.6 | 24/40 | 6/30 | 27.6 | 我们用于生成数据集与模型的完整流程已完全开源： - [代码仓库](https://github.com/NVIDIA/NeMo-Skills) - [模型权重](https://huggingface.co/collections/nvidia/openmath-2-66fb142317d86400783d2c7b) - [数据集](https://huggingface.co/datasets/nvidia/OpenMathInstruct-2) ## 结果复现我们提供了[完整流程说明](https://nvidia.github.io/NeMo-Skills/openmathinstruct2/)，可用于完全复现我们的实验结果，包括数据生成步骤。 ## 引用如果您认为我们的工作对您有帮助，请考虑引用我们的论文： bibtex @article{toshniwal2024openmath2, title = {OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data}, author = {Shubham Toshniwal and Wei Du and Ivan Moshkov and Branislav Kisacanin and Alexan Ayrapetyan and Igor Gitman}, year = {2024}, journal = {arXiv preprint arXiv:2410.01560} }

提供机构：

maas

创建时间：

2024-10-09

5,000+

优质数据集

54 个

任务类型

进入经典数据集