---
annotations_creators:
- found
language_creators:
- found
- expert-generated
language:
- en
- es
- fr
- de
- ru
- zh
- ja
- th
- sw
- bn
license:
- cc-by-sa-4.0
multilinguality:
- multilingual
size_categories:
- 1K<n<10K
source_datasets:
- extended|gsm8k
task_categories:
- text2text-generation
task_ids: []
paperswithcode_id: multi-task-language-understanding-on-mgsm
pretty_name: Multilingual Grade School Math Benchmark (MGSM)
tags:
- math-word-problems
dataset_info:
- config_name: en
features:
- name: question
dtype: string
- name: answer
dtype: string
- name: answer_number
dtype: int32
- name: equation_solution
dtype: string
splits:
- name: train
num_bytes: 3963202
num_examples: 8
- name: test
num_bytes: 713732
num_examples: 250
download_size: 4915944
dataset_size: 4676934
- config_name: es
features:
- name: question
dtype: string
- name: answer
dtype: string
- name: answer_number
dtype: int32
- name: equation_solution
dtype: string
splits:
- name: train
num_bytes: 3963202
num_examples: 8
- name: test
num_bytes: 713732
num_examples: 250
download_size: 4915944
dataset_size: 4676934
---
# Dataset Card for MGSM
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-instances)
- [Data Splits](#data-instances)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
## Dataset Description
- **Homepage:** https://openai.com/blog/grade-school-math/
- **Repository:** https://github.com/openai/grade-school-math
- **Paper:** https://arxiv.org/abs/2110.14168
- **Leaderboard:** [Needs More Information]
- **Point of Contact:** [Needs More Information]
### Dataset Summary
Multilingual Grade School Math Benchmark (MGSM) is a benchmark of grade-school math problems, proposed in the paper [Language models are multilingual chain-of-thought reasoners](http://arxiv.org/abs/2210.03057).
The same 250 problems from [GSM8K](https://arxiv.org/abs/2110.14168) are each translated via human annotators in 10 languages. The 10 languages are:
- Spanish
- French
- German
- Russian
- Chinese
- Japanese
- Thai
- Swahili
- Bengali
- Telugu
GSM8K (Grade School Math 8K) is a dataset of 8.5K high quality linguistically diverse grade school math word problems. The dataset was created to support the task of question answering on basic mathematical problems that require multi-step reasoning.
You can find the input and targets for each of the ten languages (and English) as `.tsv` files.
We also include few-shot exemplars that are also manually translated from each language in `exemplars.py`.
### Supported Tasks and Leaderboards
[Needs More Information]
### Languages
The same 250 problems from [GSM8K](https://arxiv.org/abs/2110.14168) are each translated via human annotators in 10 languages. The 10 languages are:
- Spanish
- French
- German
- Russian
- Chinese
- Japanese
- Thai
- Swahili
- Bengali
- Telugu
## Dataset Structure
### Data Instances
Each instance in the train split contains:
- a string for the grade-school level math question
- a string for the corresponding answer with chain-of-thought steps.
- the numeric solution to the question
- the equation solution to the question
```python
{'question': 'Question: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now?',
'answer': 'Step-by-Step Answer: Roger started with 5 balls. 2 cans of 3 tennis balls each is 6 tennis balls. 5 + 6 = 11. The answer is 11.',
'answer_number': 11,
'equation_solution': '5 + 6 = 11.'}
```
Each instance in the test split contains:
- a string for the grade-school level math question
- the numeric solution to the question
```python
{'question': "Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?",
'answer': None,
'answer_number': 18,
'equation_solution': None}
```
### Data Fields
The data fields are the same among `train` and `test` splits.
- question: The question string to a grade school math problem.
- answer: The full solution string to the `question`. It contains multiple steps of reasoning with calculator annotations and the final numeric solution.
- answer_number: The numeric solution to the `question`.
- equation_solution: The equation solution to the `question`.
### Data Splits
- The train split includes 8 few-shot exemplars that are also manually translated from each language.
- The test split includes the same 250 problems from GSM8K translated via human annotators in 10 languages.
| name |train|test |
|--------|----:|---------:|
|en | 8 | 250 |
|es | 8 | 250 |
|fr | 8 | 250 |
|de | 8 | 250 |
|ru | 8 | 250 |
|zh | 8 | 250 |
|ja | 8 | 250 |
|th | 8 | 250 |
|sw | 8 | 250 |
|bn | 8 | 250 |
|te | 8 | 250 |
## Dataset Creation
### Curation Rationale
[Needs More Information]
### Source Data
#### Initial Data Collection and Normalization
From the paper:
> We initially collected a starting set of a thousand problems and natural language solutions by hiring freelance contractors on Upwork (upwork.com). We then worked with Surge AI (surgehq.ai), an NLP data labeling platform, to scale up our data collection. After collecting the full dataset, we asked workers to re-solve all problems, with no workers re-solving problems they originally wrote. We checked whether their final answers agreed with the original solu- tions, and any problems that produced disagreements were either repaired or discarded. We then performed another round of agreement checks on a smaller subset of problems, finding that 1.7% of problems still produce disagreements among contractors. We estimate this to be the fraction of problems that con- tain breaking errors or ambiguities. It is possible that a larger percentage of problems contain subtle errors.
#### Who are the source language producers?
[Needs More Information]
### Annotations
#### Annotation process
[Needs More Information]
#### Who are the annotators?
Surge AI (surgehq.ai)
### Personal and Sensitive Information
[Needs More Information]
## Considerations for Using the Data
### Social Impact of Dataset
[Needs More Information]
### Discussion of Biases
[Needs More Information]
### Other Known Limitations
[Needs More Information]
## Additional Information
### Dataset Curators
[Needs More Information]
### Licensing Information
The GSM8K dataset is licensed under the [MIT License](https://opensource.org/licenses/MIT).
### Citation Information
```bibtex
@article{cobbe2021gsm8k,
title={Training Verifiers to Solve Math Word Problems},
author={Cobbe, Karl and Kosaraju, Vineet and Bavarian, Mohammad and Chen, Mark and Jun, Heewoo and Kaiser, Lukasz and Plappert, Matthias and Tworek, Jerry and Hilton, Jacob and Nakano, Reiichiro and Hesse, Christopher and Schulman, John},
journal={arXiv preprint arXiv:2110.14168},
year={2021}
}
@misc{shi2022language,
title={Language Models are Multilingual Chain-of-Thought Reasoners},
author={Freda Shi and Mirac Suzgun and Markus Freitag and Xuezhi Wang and Suraj Srivats and Soroush Vosoughi and Hyung Won Chung and Yi Tay and Sebastian Ruder and Denny Zhou and Dipanjan Das and Jason Wei},
year={2022},
eprint={2210.03057},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
### Contributions
Thanks to [@juletx](https://github.com/juletx) for adding this dataset.
annotations_creators:
- 公开搜集(found)
language_creators:
- 公开搜集(found)
- 专家生成(expert-generated)
language:
- 英语(en)
- 西班牙语(es)
- 法语(fr)
- 德语(de)
- 俄语(ru)
- 中文(zh)
- 日语(ja)
- 泰语(th)
- 斯瓦希里语(sw)
- 孟加拉语(bn)
license:
- CC BY-SA 4.0(cc-by-sa-4.0)
multilinguality:
- 多语言(multilingual)
size_categories:
- 1000 < 样本数 < 10000
source_datasets:
- 扩展版|GSM8K(extended|gsm8k)
task_categories:
- 文本到文本生成(text2text-generation)
task_ids: []
paperswithcode_id: multi-task-language-understanding-on-mgsm
pretty_name: 多语言小学算数基准(Multilingual Grade School Math Benchmark, MGSM)
tags:
- 数学应用题(math-word-problems)
dataset_info:
- config_name: 英语(en)
features:
- 名称:问题(question),数据类型:字符串(string)
- 名称:答案(answer),数据类型:字符串(string)
- 名称:数值答案(answer_number),数据类型:int32
- 名称:方程式解答(equation_solution),数据类型:字符串(string)
splits:
- 名称:训练集(train),字节数:3963202,样本数:8
- 名称:测试集(test),字节数:713732,样本数:250
download_size: 4915944
dataset_size: 4676934
- config_name: 西班牙语(es)
features:
- 名称:问题(question),数据类型:字符串(string)
- 名称:答案(answer),数据类型:字符串(string)
- 名称:数值答案(answer_number),数据类型:int32
- 名称:方程式解答(equation_solution),数据类型:字符串(string)
splits:
- 名称:训练集(train),字节数:3963202,样本数:8
- 名称:测试集(test),字节数:713732,样本数:250
download_size: 4915944
dataset_size: 4676934
## MGSM数据集卡片
## 目录
- [数据集概述](#dataset-description)
- [数据集摘要](#dataset-summary)
- [支持任务与排行榜](#supported-tasks-and-leaderboards)
- [使用语言](#languages)
- [数据集结构](#dataset-structure)
- [数据实例](#data-instances)
- [数据字段](#data-fields)
- [数据划分](#data-splits)
- [数据集构建](#dataset-creation)
- [构建初衷](#curation-rationale)
- [源数据](#source-data)
- [注释](#annotations)
- [个人与敏感信息](#personal-and-sensitive-information)
- [数据集使用注意事项](#considerations-for-using-the-data)
- [数据集的社会影响](#social-impact-of-dataset)
- [偏差讨论](#discussion-of-biases)
- [其他已知局限性](#other-known-limitations)
- [附加信息](#additional-information)
- [数据集维护者](#dataset-curators)
- [许可信息](#licensing-information)
- [引用信息](#citation-information)
## 数据集描述
- **主页**:https://openai.com/blog/grade-school-math/
- **仓库**:https://github.com/openai/grade-school-math
- **论文**:https://arxiv.org/abs/2110.14168
- **排行榜**:[待补充更多信息]
- **联系人**:[待补充更多信息]
### 数据集摘要
多语言小学算数基准(MGSM,Multilingual Grade School Math Benchmark)是一项小学算数问题基准数据集,源自论文《语言模型作为多语言思维链(chain-of-thought)推理器》(*Language models are multilingual chain-of-thought reasoners*,arXiv:2210.03057)。
该数据集将来自GSM8K(Grade School Math 8K,https://arxiv.org/abs/2110.14168)的250道原题通过人工标注者翻译成10种语言,这10种语言分别为:西班牙语、法语、德语、俄语、中文、日语、泰语、斯瓦希里语、孟加拉语、泰卢固语。
GSM8K(即小学算数8K)是一个包含8500道高质量、语言多样化的小学数学应用题数据集,其构建目的是支持需要多步推理的基础数学问题问答任务。
用户可获取10种语言(及英语)的输入与目标文件,格式为`.tsv`。我们还在`exemplars.py`中提供了经人工翻译的少样本(few-shot)示范样本。
### 支持任务与排行榜
[待补充更多信息]
### 使用语言
该数据集将来自GSM8K(Grade School Math 8K,https://arxiv.org/abs/2110.14168)的250道原题通过人工标注者翻译成10种语言,这10种语言分别为:西班牙语、法语、德语、俄语、中文、日语、泰语、斯瓦希里语、孟加拉语、泰卢固语。
## 数据集结构
### 数据实例
训练集的每个实例包含:
- 对应小学算数水平的问题字符串
- 包含思维链步骤的对应答案字符串
- 该问题的数值解
- 该问题的方程式解
示例Python代码格式:
python
{'question': '问题:罗杰有5个网球。他又买了2罐网球。每罐有3个网球。他现在总共有多少个网球?',
'answer': '分步解答:罗杰一开始有5个球。2罐每罐3个网球,共计6个网球。5 + 6 = 11。答案是11。',
'answer_number': 11,
'equation_solution': '5 + 6 = 11.'}
测试集的每个实例包含:
- 对应小学算数水平的问题字符串
- 该问题的数值解
示例Python代码格式:
python
{'question': "珍妮特的鸭子每天下16个蛋。她每天早上早餐吃3个,每天用4个蛋给朋友烤松饼。她每天将剩余的鸡蛋以每个2美元的价格在农贸市场出售。她每天在农贸市场能赚多少美元?",
'answer': None,
'answer_number': 18,
'equation_solution': None}
### 数据字段
训练集与测试集的字段定义一致:
- `question`:小学算数问题的文本字符串
- `answer`:该问题的完整解答字符串,包含多步推理过程(带标注的计算步骤)与最终数值解
- `answer_number`:该问题的数值解
- `equation_solution`:该问题的方程式解答
### 数据划分
- 训练集包含8条少样本(few-shot)示范样本,均为各语言的人工翻译版本
- 测试集包含经10种语言人工翻译的250道GSM8K原题
| 配置名称 | 训练集样本数 | 测试集样本数 |
|--------|----:|---------:|
| 英语(en) | 8 | 250 |
| 西班牙语(es) | 8 | 250 |
| 法语(fr) | 8 | 250 |
| 德语(de) | 8 | 250 |
| 俄语(ru) | 8 | 250 |
| 中文(zh) | 8 | 250 |
| 日语(ja) | 8 | 250 |
| 泰语(th) | 8 | 250 |
| 斯瓦希里语(sw) | 8 | 250 |
| 孟加拉语(bn) | 8 | 250 |
| 泰卢固语(te) | 8 | 250 |
## 数据集构建
### 构建初衷
[待补充更多信息]
### 源数据
#### 初始数据收集与标准化
引自原论文:
> 我们最初通过在Upwork(upwork.com)平台雇佣自由职业者,收集了1000道问题与自然语言解答的初始数据集。随后我们与NLP数据标注平台Surge AI(surgehq.ai)合作,扩大了数据收集规模。完成全量数据集收集后,我们要求标注者重新解答所有问题,且不允许标注者修改自己最初编写的解答。我们检查标注者的最终答案与原始解答是否一致,存在分歧的问题要么被修复,要么被丢弃。随后我们对小部分样本进行了第二轮一致性检查,发现仍有1.7%的样本存在标注者间分歧。我们估计该比例即为存在致命错误或歧义的问题占比,可能有更高比例的样本存在细微错误。
#### 源语言生成者是谁?
[待补充更多信息]
### 注释
#### 注释流程
[待补充更多信息]
#### 标注者是谁?
Surge AI(surgehq.ai)
### 个人与敏感信息
[待补充更多信息]
## 数据集使用注意事项
### 数据集的社会影响
[待补充更多信息]
### 偏差讨论
[待补充更多信息]
### 其他已知局限性
[待补充更多信息]
## 附加信息
### 数据集维护者
[待补充更多信息]
### 许可信息
GSM8K数据集采用[MIT许可协议(MIT License)](https://opensource.org/licenses/MIT)进行授权。
### 引用信息
bibtex
@article{cobbe2021gsm8k,
title={"训练验证器以求解数学应用题"},
author={Cobbe, Karl and Kosaraju, Vineet and Bavarian, Mohammad and Chen, Mark and Jun, Heewoo and Kaiser, Lukasz and Plappert, Matthias and Tworek, Jerry and Hilton, Jacob and Nakano, Reiichiro and Hesse, Christopher and Schulman, John},
journal={arXiv预印本 arXiv:2110.14168},
year={2021}
}
@misc{shi2022language,
title={"语言模型作为多语言思维链推理器"},
author={Freda Shi and Mirac Suzgun and Markus Freitag and Xuezhi Wang and Suraj Srivats and Soroush Vosoughi and Hyung Won Chung and Yi Tay and Sebastian Ruder and Denny Zhou and Dipanjan Das and Jason Wei},
year={2022},
eprint={2210.03057},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
### 贡献
感谢[@juletx](https://github.com/juletx)贡献此数据集。