OpenMathInstruct-1
收藏魔搭社区2025-10-09 更新2025-01-25 收录
下载链接:
https://modelscope.cn/datasets/nv-community/OpenMathInstruct-1
下载链接
链接失效反馈官方服务:
资源简介:
# OpenMathInstruct-1
OpenMathInstruct-1 is a math instruction tuning dataset with 1.8M problem-solution pairs
generated using permissively licensed [Mixtral-8x7B](https://huggingface.co/mistralai/Mixtral-8x7B-v0.1) model.
The problems are from [GSM8K](https://github.com/openai/grade-school-math)
and [MATH](https://github.com/hendrycks/math) training subsets and the solutions
are synthetically generated by allowing Mixtral model to use a mix of text reasoning and
code blocks executed by Python interpreter.
The dataset is split into train and validation subsets that we used in the ablations experiments.
These two subsets combined together cover the full training set of GSM8K and MATH.
OpenMathInstruct-1 dataset contains of the following fields:
- **question**: original question from either GSM8K or MATH training set.
- **generated_solution**: the synthetically generated solution that uses a mix of text reasoning and code blocks.
- **expected_answer**: the ground-truth answer provided in the original dataset.
- **predicted_answer**: the answer predicted by Mixtral model in the corresponding solution (extracted from `\boxed{}`).
- **error_message**: `` if code was not used. Otherwise it's empty or contains a Python exception
from the corresponding code block. A `timeout` string indicates that code block took longer than 10 seconds to
execute. In the current dataset version we always stop generation after any error or a timeout.
- **is_correct**: whether the final answer was considered correct by our grading script.
- **dataset**: gsm8k or math.
- **generation_type**: `without_reference_solution` or `masked_reference_solution`.
We also release the masked solutions used to produce `generation_type="masked_reference_solution"`
portion of the dataset ([GSM8K-Masked](https://huggingface.co/datasets/nvidia/OpenMath-GSM8K-masked),
[MATH-Masked](https://huggingface.co/datasets/nvidia/OpenMath-MATH-masked)).
See our [paper](https://arxiv.org/abs/2402.10176) to learn more details!
## OpenMath models
To demonstrate the quality of this dataset, we release a series of OpenMath models
trained on this data (a combination of train and validation splits to allow comparison with prior work).
greedy
majority@50
model
GSM8K
MATH
GMS8K
MATH
OpenMath-CodeLlama-7B (nemo | HF)
75.9
43.6
84.8
55.6
OpenMath-Mistral-7B (nemo | HF)
80.2
44.5
86.9
57.2
OpenMath-CodeLlama-13B (nemo | HF)
78.8
45.5
86.8
57.6
OpenMath-CodeLlama-34B (nemo | HF)
80.7
48.3
88.0
60.2
OpenMath-Llama2-70B (nemo | HF)
84.7
46.3
90.1
58.3
OpenMath-CodeLlama-70B (nemo | HF)
84.6
50.7
90.8
60.4
The pipeline we used to produce the data and models is fully open-sourced!
- [Code](https://github.com/Kipok/NeMo-Skills)
- [Models](https://huggingface.co/collections/nvidia/openmath-65c5619de2ba059be0775014)
- [Dataset](https://huggingface.co/datasets/nvidia/OpenMathInstruct-1)
## Reproducing our results
We provide [all instructions](https://github.com/Kipok/NeMo-Skills/blob/main/docs/reproducing-results.md)
to fully reproduce our results, including data generation.
## Generating similar datasets
To generate similar datasets for other tasks or to learn more about our code, read through the docs below.
- [NeMo-Skills Pipeline](https://github.com/Kipok/NeMo-Skills)
- [Generating synthetic data](https://github.com/Kipok/NeMo-Skills/blob/main/docs/synthetic-data-generation.md)
- [Finetuning models](https://github.com/Kipok/NeMo-Skills/blob/main/docs/finetuning.md)
- [Evaluating models](https://github.com/Kipok/NeMo-Skills/blob/main/docs/evaluation.md)
## Citation
If you find our work useful, please consider citing us!
```bibtex
@article{toshniwal2024openmath,
title = {OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset},
author = {Shubham Toshniwal and Ivan Moshkov and Sean Narenthiran and Daria Gitman and Fei Jia and Igor Gitman},
year = {2024},
journal = {arXiv preprint arXiv: Arxiv-2402.10176}
}
```
## License
The use of this dataset is governed by the [NVIDIA License](LICENSE) which permits commercial usage.
# OpenMathInstruct-1
OpenMathInstruct-1 是一款包含180万道题解对的数学指令微调数据集,其生成基于获得宽松开源许可的[Mixtral-8x7B](https://huggingface.co/mistralai/Mixtral-8x7B-v0.1)大模型。
本数据集的题目源自[GSM8K](https://github.com/openai/grade-school-math)与[MATH](https://github.com/hendrycks/math)的训练子集,题解则通过允许Mixtral模型结合自然语言推理与Python解释器执行的代码块进行合成生成。
本数据集被划分为训练子集与验证子集,用于我们的消融实验;二者合并后可覆盖GSM8K与MATH的全部训练集。
OpenMathInstruct-1数据集包含以下字段:
- **question**:源自GSM8K或MATH训练集的原始题目
- **generated_solution**:通过混合自然语言推理与代码块合成生成的题解
- **expected_answer**:原始数据集提供的标准答案(真值答案)
- **predicted_answer**:Mixtral模型在对应题解中预测得到的答案(从`oxed{}`标记中提取)
- **error_message**:若未使用代码则该字段为空字符串;若使用代码则该字段为空或包含对应代码块抛出的Python异常信息。当字段值为`timeout`时,表示对应代码块执行时长超过10秒。在当前版本的数据集里,只要出现任何错误或超时,我们都会终止生成流程
- **is_correct**:表示最终答案是否通过我们的评分脚本判定为正确的布尔值
- **dataset**:标识数据集来源,取值为`gsm8k`或`math`
- **generation_type**:表示生成类型,取值为`without_reference_solution`(无参考题解)或`masked_reference_solution`(掩码参考题解)
我们还发布了用于生成`generation_type="masked_reference_solution"`部分数据集的掩码题解,相关资源包括[GSM8K-Masked](https://huggingface.co/datasets/nvidia/OpenMath-GSM8K-masked)与[MATH-Masked](https://huggingface.co/datasets/nvidia/OpenMath-MATH-masked)。
如需了解更多细节,请参阅我们的[论文](https://arxiv.org/abs/2402.10176)!
## OpenMath 模型系列
为展示本数据集的质量,我们发布了一系列基于该数据集(训练子集与验证子集的合并数据)训练得到的OpenMath模型,以便与此前的相关工作进行对比。
| 模型 | 贪心解码(greedy) | | 50票多数投票(majority@50) | |
| ---- | ---- | ---- | ---- | ---- |
| | GSM8K | MATH | GSM8K | MATH |
| OpenMath-CodeLlama-7B (nemo | HF) | 75.9 | 43.6 | 84.8 | 55.6 |
| OpenMath-Mistral-7B (nemo | HF) | 80.2 | 44.5 | 86.9 | 57.2 |
| OpenMath-CodeLlama-13B (nemo | HF) | 78.8 | 45.5 | 86.8 | 57.6 |
| OpenMath-CodeLlama-34B (nemo | HF) | 80.7 | 48.3 | 88.0 | 60.2 |
| OpenMath-Llama2-70B (nemo | HF) | 84.7 | 46.3 | 90.1 | 58.3 |
| OpenMath-CodeLlama-70B (nemo | HF) | 84.6 | 50.7 | 90.8 | 60.4 |
我们用于构建数据集与模型的完整流程已完全开源!
- [代码仓库](https://github.com/Kipok/NeMo-Skills)
- [模型资源](https://huggingface.co/collections/nvidia/openmath-65c5619de2ba059be0775014)
- [数据集资源](https://huggingface.co/datasets/nvidia/OpenMathInstruct-1)
## 复现实验结果
我们提供了[完整的操作指南](https://github.com/Kipok/NeMo-Skills/blob/main/docs/reproducing-results.md),可用于完全复现我们的实验结果,包括数据集生成流程。
## 生成类似数据集
如需为其他任务生成类似数据集,或深入了解我们的代码,请参阅以下文档:
- [NeMo-Skills 流程框架](https://github.com/Kipok/NeMo-Skills)
- [合成数据生成](https://github.com/Kipok/NeMo-Skills/blob/main/docs/synthetic-data-generation.md)
- [模型微调](https://github.com/Kipok/NeMo-Skills/blob/main/docs/finetuning.md)
- [模型评估](https://github.com/Kipok/NeMo-Skills/blob/main/docs/evaluation.md)
## 引用声明
若您认为我们的工作对您有所帮助,请引用我们的研究:
bibtex
@article{toshniwal2024openmath,
title = {OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset},
author = {Shubham Toshniwal and Ivan Moshkov and Sean Narenthiran and Daria Gitman and Fei Jia and Igor Gitman},
year = {2024},
journal = {arXiv preprint arXiv: Arxiv-2402.10176}
}
## 许可协议
本数据集的使用受[NVIDIA许可协议](LICENSE)约束,该协议允许商业使用。
提供机构:
maas
创建时间:
2025-01-20



