five

OpenMathInstruct-2

收藏
魔搭社区2026-05-16 更新2024-10-12 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/OpenMathInstruct-2
下载链接
链接失效反馈
官方服务:
资源简介:
# OpenMathInstruct-2 OpenMathInstruct-2 is a math instruction tuning dataset with 14M problem-solution pairs generated using the [Llama3.1-405B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-405B-Instruct) model. The training set problems of [GSM8K](https://github.com/openai/grade-school-math) and [MATH](https://github.com/hendrycks/math) are used for constructing the dataset in the following ways: - *Solution augmentation*: Generating chain-of-thought solutions for training set problems in GSM8K and MATH. - *Problem-Solution augmentation*: Generating new problems, followed by solutions for these new problems. <p> <img src="SFT Data Diagram 1.jpg" width="75%" title="Composition of OpenMathInstruct-2"> </p> OpenMathInstruct-2 dataset contains the following fields: - **problem**: Original problem from either the GSM8K or MATH training set or augmented problem from these training sets. - **generated_solution**: Synthetically generated solution. - **expected_answer**: For problems in the training set, it is the ground-truth answer provided in the datasets. **For augmented problems, it is the majority-voting answer.** - **problem_source**: Whether the problem is taken directly from GSM8K or MATH or is an augmented version derived from either dataset. <p> <img src="scaling_plot.jpg" width="40%" title="Scaling Curve"> </p> We also release the 1M, 2M, and 5M, *fair-downsampled* versions of the entire training set corresponding to points in the above scaling plot. These splits are referred to as **train_1M**, **train_2M**, and **train_5M**. To use these subsets, just specify one of these subsets as split while downloading the data: ```python from datasets import load_dataset # Download only the 1M training split dataset = load_dataset('nvidia/OpenMathInstruct-2', split='train_1M', streaming=True) ``` To download the entire training set and to convert it into the jsonl format, use the following code snippet. This might take 20-30 minutes (or more depending on your network connection) and will use ~20Gb of RAM. ```python import json from datasets import load_dataset from tqdm import tqdm dataset = load_dataset('nvidia/OpenMathInstruct-2', split='train') print("Converting dataset to jsonl format") output_file = "openmathinstruct2.jsonl" with open(output_file, 'w', encoding='utf-8') as f: for item in tqdm(dataset): f.write(json.dumps(item, ensure_ascii=False) + '\n') print(f"Conversion complete. Output saved as {output_file}") ``` Apart from the dataset, we also release the [contamination explorer](https://huggingface.co/spaces/nvidia/OpenMathInstruct-2-explorer) for looking at problems in the OpenMathInstruct-2 dataset that are similar to the [GSM8K](https://huggingface.co/datasets/openai/gsm8k), [MATH](https://github.com/hendrycks/math), [AMC 2023](https://github.com/QwenLM/Qwen2.5-Math/tree/main/evaluation/data/amc23), [AIME 2024](https://artofproblemsolving.com/wiki/index.php/2024_AIME_I), and [Omni-MATH](https://huggingface.co/datasets/KbsdJames/Omni-MATH) test set problems. See our [paper](https://arxiv.org/abs/2410.01560) to learn more details! ### Note The released dataset doesn't filter out extremely long questions. After the dataset release, we found that 564 questions (roughly 0.1%) were longer than 1024 Llama tokens. We experimented with removing these questions and didn't see a performance drop (in fact, we observed a minor bump). Dropping these questions, helps with memory as well. So we would recommend, filtering out extremely long questions. We have updated the data preparation commands in our [Github documentation](https://nvidia.github.io/NeMo-Skills/openmathinstruct2/dataset/#converting-to-sft-format). ## OpenMath2 models To demonstrate the quality of this dataset, we release a series of OpenMath2 models trained on this data. | Model | GSM8K | MATH | AMC 2023 | AIME 2024 | Omni-MATH | |:---|:---:|:---:|:---:|:---:|:---:| | Llama3.1-8B-Instruct | 84.5 | 51.9 | 9/40 | 2/30 | 12.7 | | OpenMath2-Llama3.1-8B ([nemo](https://huggingface.co/nvidia/OpenMath2-Llama3.1-8B-nemo) \| [HF](https://huggingface.co/nvidia/OpenMath2-Llama3.1-8B)) | 91.7 | 67.8 | 16/40 | 3/30 | 22.0 | | + majority@256 | 94.1 | 76.1 | 23/40 | 3/30 | 24.6 | | Llama3.1-70B-Instruct | 95.8 | 67.9 | 19/40 | 6/30 | 19.0 | | OpenMath2-Llama3.1-70B ([nemo](https://huggingface.co/nvidia/OpenMath2-Llama3.1-70B-nemo) \| [HF](https://huggingface.co/nvidia/OpenMath2-Llama3.1-70B)) | 94.9 | 71.9 | 20/40 | 4/30 | 23.1 | | + majority@256 | 96.0 | 79.6 | 24/40 | 6/30 | 27.6 | The pipeline we used to produce the data and models is fully open-sourced! - [Code](https://github.com/NVIDIA/NeMo-Skills) - [Models](https://huggingface.co/collections/nvidia/openmath-2-66fb142317d86400783d2c7b) - [Dataset](https://huggingface.co/datasets/nvidia/OpenMathInstruct-2) ## Reproducing our results We provide [all instructions](https://nvidia.github.io/NeMo-Skills/openmathinstruct2/) to fully reproduce our results, including data generation. ## Citation If you find our work useful, please consider citing us! ```bibtex @article{toshniwal2024openmath2, title = {OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data}, author = {Shubham Toshniwal and Wei Du and Ivan Moshkov and Branislav Kisacanin and Alexan Ayrapetyan and Igor Gitman}, year = {2024}, journal = {arXiv preprint arXiv:2410.01560} } ```

# OpenMathInstruct-2 OpenMathInstruct-2 是一款数学指令微调数据集,包含1400万条问题-解答对,由 [Llama3.1-405B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-405B-Instruct) 模型生成。 本数据集基于 [GSM8K](https://github.com/openai/grade-school-math) 与 [MATH](https://github.com/hendrycks/math) 的训练集问题构建,具体构建方式如下: - *解答增强(Solution augmentation)*:为GSM8K与MATH训练集的问题生成思维链(chain-of-thought)解答。 - *问题-解答增强(Problem-Solution augmentation)*:先生成全新的问题,再为这些新问题生成对应的解答。 <p> <img src="SFT Data Diagram 1.jpg" width="75%" title="OpenMathInstruct-2 的组成结构"> </p> OpenMathInstruct-2 数据集包含以下字段: - **problem(问题)**:源自GSM8K或MATH训练集的原始问题,或是基于上述训练集生成的增强问题。 - **generated_solution(生成解答)**:由模型合成生成的解答。 - **expected_answer(预期答案)**:对于训练集原始问题,该字段为数据集自带的标准答案;对于增强生成的问题,该字段为多数投票得到的答案。 - **problem_source(问题来源)**:标识该问题是直接取自GSM8K或MATH训练集,还是由任一数据集衍生的增强版本。 <p> <img src="scaling_plot.jpg" width="40%" title="缩放曲线(Scaling Curve)"> </p> 我们还发布了对应上述缩放曲线中采样点的100万、200万及500万条数据的*公平下采样(fair-downsampled)*版本完整训练集。这些子集分别命名为**train_1M**、**train_2M**和**train_5M**。如需使用这些子集,仅需在下载数据时将对应子集指定为拆分方式即可: python from datasets import load_dataset # 仅下载100万条训练子集 dataset = load_dataset('nvidia/OpenMathInstruct-2', split='train_1M', streaming=True) 如需下载完整训练集并将其转换为jsonl格式,可使用以下代码片段。该过程可能需要20-30分钟(具体时长取决于网络状况),并将占用约20GB的内存空间。 python import json from datasets import load_dataset from tqdm import tqdm dataset = load_dataset('nvidia/OpenMathInstruct-2', split='train') print("正在将数据集转换为jsonl格式") output_file = "openmathinstruct2.jsonl" with open(output_file, 'w', encoding='utf-8') as f: for item in tqdm(dataset): f.write(json.dumps(item, ensure_ascii=False) + ' ') print(f"转换完成,输出文件已保存至 {output_file}") 除本数据集外,我们还发布了[污染探查工具(contamination explorer)](https://huggingface.co/spaces/nvidia/OpenMathInstruct-2-explorer),用于查询OpenMathInstruct-2数据集中与[GSM8K](https://huggingface.co/datasets/openai/gsm8k)、[MATH](https://github.com/hendrycks/math)、[AMC 2023](https://github.com/QwenLM/Qwen2.5-Math/tree/main/evaluation/data/amc23)、[AIME 2024](https://artofproblemsolving.com/wiki/index.php/2024_AIME_I)及[Omni-MATH](https://huggingface.co/datasets/KbsdJames/Omni-MATH)测试集问题相似的题目。 如需了解更多细节,请参阅我们的[论文](https://arxiv.org/abs/2410.01560)! ### 注意事项 本次发布的数据集未过滤超长问题。数据集发布后,我们发现共有564条问题(约占总数据的0.1%)长度超过1024个Llama Token。我们曾尝试移除这些问题,结果并未出现性能下降(实际上还观察到了小幅性能提升)。移除超长问题同时也能节省内存占用,因此我们建议对数据集进行超长问题过滤。我们已在[GitHub文档](https://nvidia.github.io/NeMo-Skills/openmathinstruct2/dataset/#converting-to-sft-format)中更新了数据预处理命令。 ## OpenMath2 模型 | 模型 | GSM8K | MATH | AMC 2023 | AIME 2024 | Omni-MATH | |:---|:---:|:---:|:---:|:---:|:---:| | Llama3.1-8B-Instruct | 84.5 | 51.9 | 9/40 | 2/30 | 12.7 | | OpenMath2-Llama3.1-8B ([nemo](https://huggingface.co/nvidia/OpenMath2-Llama3.1-8B-nemo) | [HF](https://huggingface.co/nvidia/OpenMath2-Llama3.1-8B)) | 91.7 | 67.8 | 16/40 | 3/30 | 22.0 | | + 256票多数投票(majority@256) | 94.1 | 76.1 | 23/40 | 3/30 | 24.6 | | Llama3.1-70B-Instruct | 95.8 | 67.9 | 19/40 | 6/30 | 19.0 | | OpenMath2-Llama3.1-70B ([nemo](https://huggingface.co/nvidia/OpenMath2-Llama3.1-70B-nemo) | [HF](https://huggingface.co/nvidia/OpenMath2-Llama3.1-70B)) | 94.9 | 71.9 | 20/40 | 4/30 | 23.1 | | + 256票多数投票(majority@256) | 96.0 | 79.6 | 24/40 | 6/30 | 27.6 | 我们用于生成数据集与模型的完整流程已完全开源: - [代码仓库](https://github.com/NVIDIA/NeMo-Skills) - [模型权重](https://huggingface.co/collections/nvidia/openmath-2-66fb142317d86400783d2c7b) - [数据集](https://huggingface.co/datasets/nvidia/OpenMathInstruct-2) ## 结果复现 我们提供了[完整流程说明](https://nvidia.github.io/NeMo-Skills/openmathinstruct2/),可用于完全复现我们的实验结果,包括数据生成步骤。 ## 引用 如果您认为我们的工作对您有帮助,请考虑引用我们的论文: bibtex @article{toshniwal2024openmath2, title = {OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data}, author = {Shubham Toshniwal and Wei Du and Ivan Moshkov and Branislav Kisacanin and Alexan Ayrapetyan and Igor Gitman}, year = {2024}, journal = {arXiv preprint arXiv:2410.01560} }
提供机构:
maas
创建时间:
2024-10-09
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作