mgsm
收藏魔搭社区2026-05-16 更新2025-11-22 收录
下载链接:
https://modelscope.cn/datasets/evalscope/mgsm
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for MGSM
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-instances)
- [Data Splits](#data-instances)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
## Dataset Description
- **Homepage:** https://openai.com/blog/grade-school-math/
- **Repository:** https://github.com/openai/grade-school-math
- **Paper:** https://arxiv.org/abs/2110.14168
- **Leaderboard:** [Needs More Information]
- **Point of Contact:** [Needs More Information]
### Dataset Summary
Multilingual Grade School Math Benchmark (MGSM) is a benchmark of grade-school math problems, proposed in the paper [Language models are multilingual chain-of-thought reasoners](http://arxiv.org/abs/2210.03057).
The same 250 problems from [GSM8K](https://arxiv.org/abs/2110.14168) are each translated via human annotators in 10 languages. The 10 languages are:
- Spanish
- French
- German
- Russian
- Chinese
- Japanese
- Thai
- Swahili
- Bengali
- Telugu
GSM8K (Grade School Math 8K) is a dataset of 8.5K high quality linguistically diverse grade school math word problems. The dataset was created to support the task of question answering on basic mathematical problems that require multi-step reasoning.
You can find the input and targets for each of the ten languages (and English) as `.tsv` files.
We also include few-shot exemplars that are also manually translated from each language in `exemplars.py`.
### Supported Tasks and Leaderboards
[Needs More Information]
### Languages
The same 250 problems from [GSM8K](https://arxiv.org/abs/2110.14168) are each translated via human annotators in 10 languages. The 10 languages are:
- Spanish
- French
- German
- Russian
- Chinese
- Japanese
- Thai
- Swahili
- Bengali
- Telugu
## Dataset Structure
### Data Instances
Each instance in the train split contains:
- a string for the grade-school level math question
- a string for the corresponding answer with chain-of-thought steps.
- the numeric solution to the question
- the equation solution to the question
```python
{'question': 'Question: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now?',
'answer': 'Step-by-Step Answer: Roger started with 5 balls. 2 cans of 3 tennis balls each is 6 tennis balls. 5 + 6 = 11. The answer is 11.',
'answer_number': 11,
'equation_solution': '5 + 6 = 11.'}
```
Each instance in the test split contains:
- a string for the grade-school level math question
- the numeric solution to the question
```python
{'question': "Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?",
'answer': None,
'answer_number': 18,
'equation_solution': None}
```
### Data Fields
The data fields are the same among `train` and `test` splits.
- question: The question string to a grade school math problem.
- answer: The full solution string to the `question`. It contains multiple steps of reasoning with calculator annotations and the final numeric solution.
- answer_number: The numeric solution to the `question`.
- equation_solution: The equation solution to the `question`.
### Data Splits
- The train split includes 8 few-shot exemplars that are also manually translated from each language.
- The test split includes the same 250 problems from GSM8K translated via human annotators in 10 languages.
| name |train|test |
|--------|----:|---------:|
|en | 8 | 250 |
|es | 8 | 250 |
|fr | 8 | 250 |
|de | 8 | 250 |
|ru | 8 | 250 |
|zh | 8 | 250 |
|ja | 8 | 250 |
|th | 8 | 250 |
|sw | 8 | 250 |
|bn | 8 | 250 |
|te | 8 | 250 |
## Dataset Creation
### Curation Rationale
[Needs More Information]
### Source Data
#### Initial Data Collection and Normalization
From the paper:
> We initially collected a starting set of a thousand problems and natural language solutions by hiring freelance contractors on Upwork (upwork.com). We then worked with Surge AI (surgehq.ai), an NLP data labeling platform, to scale up our data collection. After collecting the full dataset, we asked workers to re-solve all problems, with no workers re-solving problems they originally wrote. We checked whether their final answers agreed with the original solu- tions, and any problems that produced disagreements were either repaired or discarded. We then performed another round of agreement checks on a smaller subset of problems, finding that 1.7% of problems still produce disagreements among contractors. We estimate this to be the fraction of problems that con- tain breaking errors or ambiguities. It is possible that a larger percentage of problems contain subtle errors.
#### Who are the source language producers?
[Needs More Information]
### Annotations
#### Annotation process
[Needs More Information]
#### Who are the annotators?
Surge AI (surgehq.ai)
### Personal and Sensitive Information
[Needs More Information]
## Considerations for Using the Data
### Social Impact of Dataset
[Needs More Information]
### Discussion of Biases
[Needs More Information]
### Other Known Limitations
[Needs More Information]
## Additional Information
### Dataset Curators
[Needs More Information]
### Licensing Information
The GSM8K dataset is licensed under the [MIT License](https://opensource.org/licenses/MIT).
### Citation Information
```bibtex
@article{cobbe2021gsm8k,
title={Training Verifiers to Solve Math Word Problems},
author={Cobbe, Karl and Kosaraju, Vineet and Bavarian, Mohammad and Chen, Mark and Jun, Heewoo and Kaiser, Lukasz and Plappert, Matthias and Tworek, Jerry and Hilton, Jacob and Nakano, Reiichiro and Hesse, Christopher and Schulman, John},
journal={arXiv preprint arXiv:2110.14168},
year={2021}
}
@misc{shi2022language,
title={Language Models are Multilingual Chain-of-Thought Reasoners},
author={Freda Shi and Mirac Suzgun and Markus Freitag and Xuezhi Wang and Suraj Srivats and Soroush Vosoughi and Hyung Won Chung and Yi Tay and Sebastian Ruder and Denny Zhou and Dipanjan Das and Jason Wei},
year={2022},
eprint={2210.03057},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
### Contributions
Thanks to [@juletx](https://github.com/juletx) for adding this dataset.
# MGSM 数据集卡片
## 目录
- [数据集描述](#dataset-description)
- [数据集概述](#dataset-summary)
- [支持任务与评测基准](#supported-tasks-and-leaderboards)
- [涉及语言](#languages)
- [数据集结构](#dataset-structure)
- [数据实例](#data-instances)
- [数据字段](#data-fields)
- [数据划分](#data-splits)
- [数据集构建](#dataset-creation)
- [构建初衷](#curation-rationale)
- [源数据](#source-data)
- [标注信息](#annotations)
- [个人与敏感信息](#personal-and-sensitive-information)
- [数据集使用注意事项](#considerations-for-using-the-data)
- [数据集的社会影响](#social-impact-of-dataset)
- [偏差讨论](#discussion-of-biases)
- [其他已知局限](#other-known-limitations)
- [附加信息](#additional-information)
- [数据集维护者](#dataset-curators)
- [许可信息](#licensing-information)
- [引用信息](#citation-information)
## 数据集描述
- **主页**:https://openai.com/blog/grade-school-math/
- **代码仓库**:https://github.com/openai/grade-school-math
- **相关论文**:https://arxiv.org/abs/2110.14168
- **评测基准**:[待补充更多信息]
- **联络人**:[待补充更多信息]
### 数据集概述
多语言中小学数学基准(MGSM)是一个中小学数学问题基准,源自论文《语言模型是多语言思维链推理器》(Language models are multilingual chain-of-thought reasoners)。
该数据集包含来自[中小学数学8K数据集(GSM8K)](https://arxiv.org/abs/2110.14168)的250道题目,由标注人员人工翻译成10种语言,这10种语言分别为:
- 西班牙语
- 法语
- 德语
- 俄语
- 中文
- 日语
- 泰语
- 斯瓦希里语
- 孟加拉语
- 泰卢固语
中小学数学8K数据集(GSM8K)包含8500道高质量、语言多样化的中小学数学应用题,该数据集旨在支持需要多步推理的基础数学问题问答任务。
你可以通过`.tsv`文件获取10种语言(及英语)对应的输入与目标标签。我们还在`exemplars.py`中提供了针对每种语言的少样本示例,这些示例同样由人工翻译得到。
### 支持任务与评测基准
[待补充更多信息]
### 涉及语言
来自[中小学数学8K数据集(GSM8K)](https://arxiv.org/abs/2110.14168)的250道题目,由标注人员人工翻译成10种语言,这10种语言分别为:
- 西班牙语
- 法语
- 德语
- 俄语
- 中文
- 日语
- 泰语
- 斯瓦希里语
- 孟加拉语
- 泰卢固语
## 数据集结构
### 数据实例
训练划分下的每个数据实例包含:
- 一道中小学数学问题的字符串形式
- 包含思维链推理步骤的对应答案字符串
- 该问题的数值解
- 该问题的方程式解
python
{'question': 'Question: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now?',
'answer': 'Step-by-Step Answer: Roger started with 5 balls. 2 cans of 3 tennis balls each is 6 tennis balls. 5 + 6 = 11. The answer is 11.',
'answer_number': 11,
'equation_solution': '5 + 6 = 11.'}
测试划分下的每个数据实例包含:
- 一道中小学数学问题的字符串形式
- 该问题的数值解
python
{'question': "Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?",
'answer': None,
'answer_number': 18,
'equation_solution': None}
### 数据字段
训练集与测试集划分下的数据字段保持一致:
- `question`:中小学数学问题的问题文本
- `answer`:对应问题的完整解答文本,包含多步推理过程、计算器标注与最终数值解
- `answer_number`:对应问题的数值解
- `equation_solution`:对应问题的方程式解
### 数据划分
- 训练划分包含8条少样本示例,这些示例同样由针对每种语言人工翻译得到
- 测试划分包含来自GSM8K的250道题目,已由标注人员人工翻译成10种语言
| 语言代码 | 训练集样本数 | 测试集样本数 |
|--------|----:|---------:|
| en(英语) | 8 | 250 |
| es(西班牙语) | 8 | 250 |
| fr(法语) | 8 | 250 |
| de(德语) | 8 | 250 |
| ru(俄语) | 8 | 250 |
| zh(中文) | 8 | 250 |
| ja(日语) | 8 | 250 |
| th(泰语) | 8 | 250 |
| sw(斯瓦希里语) | 8 | 250 |
| bn(孟加拉语) | 8 | 250 |
| te(泰卢固语) | 8 | 250 |
## 数据集构建
### 构建初衷
[待补充更多信息]
### 源数据
#### 初始数据收集与标准化
源自论文:
> 我们最初通过在Upwork(upwork.com)平台雇佣自由职业者,收集了1000道题目与自然语言解答作为初始数据集。随后我们与NLP数据标注平台Surge AI(surgehq.ai)合作,扩大了数据收集规模。完成全量数据集收集后,我们要求标注人员重新解答所有题目,且禁止标注人员解答自己最初编写的题目。我们会检查标注人员的最终答案与原始解答是否一致,所有存在分歧的题目要么被修复,要么被丢弃。随后我们针对一小部分题目进行了第二轮一致性检查,发现仍有1.7%的题目存在标注分歧。我们估算该比例即为存在明显错误或歧义的题目占比,可能有更高比例的题目存在细微错误。
#### 源语言编写者信息
[待补充更多信息]
### 标注信息
#### 标注流程
[待补充更多信息]
#### 标注人员信息
Surge AI(surgehq.ai)
### 个人与敏感信息
[待补充更多信息]
## 数据集使用注意事项
### 数据集的社会影响
[待补充更多信息]
### 偏差讨论
[待补充更多信息]
### 其他已知局限
[待补充更多信息]
## 附加信息
### 数据集维护者
[待补充更多信息]
### 许可信息
中小学数学8K数据集(GSM8K)采用[MIT许可协议](https://opensource.org/licenses/MIT)进行授权。
### 引用信息
bibtex
@article{cobbe2021gsm8k,
title={Training Verifiers to Solve Math Word Problems},
author={Cobbe, Karl and Kosaraju, Vineet and Bavarian, Mohammad and Chen, Mark and Jun, Heewoo and Kaiser, Lukasz and Plappert, Matthias and Tworek, Jerry and Hilton, Jacob and Nakano, Reiichiro and Hesse, Christopher and Schulman, John},
journal={arXiv preprint arXiv:2110.14168},
year={2021}
}
@misc{shi2022language,
title={Language Models are Multilingual Chain-of-Thought Reasoners},
author={Freda Shi and Mirac Suzgun and Markus Freitag and Xuezhi Wang and Suraj Srivats and Soroush Vosoughi and Hyung Won Chung and Yi Tay and Sebastian Ruder and Denny Zhou and Dipanjan Das and Jason Wei},
year={2022},
eprint={2210.03057},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
### 贡献
感谢 [@juletx](https://github.com/juletx) 贡献本数据集。
提供机构:
maas
创建时间:
2025-11-18



