orca-math-word-problems-200k
收藏魔搭社区2026-01-08 更新2024-05-15 收录
下载链接:
https://modelscope.cn/datasets/microsoft/orca-math-word-problems-200k
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card
<!-- Provide a quick summary of the dataset. -->
This dataset contains ~200K grade school math word problems. All the answers in this dataset is generated using Azure GPT4-Turbo. Please refer to [Orca-Math: Unlocking the potential of
SLMs in Grade School Math](https://arxiv.org/pdf/2402.14830.pdf) for details about the dataset construction.
### Dataset Description
- **Curated by:** Microsoft
- **Language(s) (NLP):** English
- **License:** MIT
### Dataset Sources
<!-- Provide the basic links for the dataset. -->
- **Repository:** [microsoft/orca-math-word-problems-200k](https://huggingface.co/datasets/microsoft/orca-math-word-problems-200k)
- **Paper:** [Orca-Math: Unlocking the potential of
SLMs in Grade School Math](https://arxiv.org/pdf/2402.14830.pdf)
### Direct Use
<!-- This section describes suitable use cases for the dataset. -->
This dataset has been designed to enhance the mathematical abilities of language models. It aims to provide a robust foundation for language models to excel in mathematical problem-solving.
### Out-of-Scope Use
<!-- This section addresses misuse, malicious use, and uses that the dataset will not work well for. -->
This dataset is not intended for use in educational systems or organizations.
## Dataset Structure
### Data Instances
A typical data entry in the dataset consists of a question and its corresponding answer. Below is an example from the dataset:
```python
{'question': 'In a highly contested election having multiple candidates, Mr. Jackson, one of the losing candidates, received 3,485,782 votes, which accounted for precisely 38.7 percent of all votes. To have achieved a victory, he would have needed to secure at least 51 percent of all votes. Approximately, what percent of the remaining unsecured votes would Mr. Jackson have needed to accumulate to reach this victory threshold?',
'answer': "First, let's find out the total number of votes cast in the election. Since Mr. Jackson received 38.7% of all votes, and that amounted to 3,485,782 votes, we can set up the following equation to find the total number of votes (T):\n\n0.387 * T = 3,485,782\n\nNow, solve for T:\n\nT = 3,485,782 / 0.387\nT ≈ 9,000,467 votes (total number of votes cast)\n\nTo win, Mr. Jackson would have needed 51% of the total votes. Let's calculate that amount:\n\n0.51 * T = 0.51 * 9,000,467\n0.51 * T ≈ 4,590,238 votes needed to win\n\nNow, let's find out how many more votes Mr. Jackson needed to reach this winning threshold:\n\nVotes needed to win - Votes Mr. Jackson received = Additional votes needed\n4,590,238 - 3,485,782 = 1,104,456 additional votes needed\n\nNow, let's find out what percentage of the remaining unsecured votes this number represents. The remaining unsecured votes are the votes that were not for Mr. Jackson, which is 100% - 38.7% = 61.3% of the total votes.\n\n61.3% of the total votes is the remaining unsecured votes:\n\n0.613 * T = 0.613 * 9,000,467\n0.613 * T ≈ 5,514,686 votes were unsecured\n\nNow, we'll calculate the percentage of these unsecured votes that the additional votes needed represent:\n\n(Additional votes needed / Unsecured votes) * 100 = Percentage of unsecured votes needed\n(1,104,456 / 5,514,686) * 100 ≈ 20.03%\n\nSo, Mr. Jackson would have needed approximately 20.03% of the remaining unsecured votes to reach the victory threshold of 51%."}
```
### Data Fields
The dataset comprises the following fields:
- `question`: a string containing the question to be answered.
- `answer`: a string containing the answer to the corresponding question.
### Data Splits
The dataset is split into a training set. The number of rows in each split is as follows:
- `train`: 200,035 rows
The `DatasetDict` structure for the dataset is as follows:
```python
DatasetDict({
'train': Dataset({
features: ['question', 'answer'],
num_rows: 200035
})
})
```
Each split in the `DatasetDict` contains a `Dataset` object with the specified features and number of rows.
## Dataset Creation
Please refer to [Orca-Math: Unlocking the potential of
SLMs in Grade School Math](https://arxiv.org/pdf/2402.14830.pdf) for details about the dataset construction.
### Source Data
- [Lila](https://huggingface.co/datasets/allenai/lila)
- [DMath](https://arxiv.org/ftp/arxiv/papers/2106/2106.15772.pdf)
#### Data Collection and Processing
<!-- This section describes the data collection and processing process such as data selection criteria, filtering and normalization methods, tools and libraries used, etc. -->
Please refer to [Orca-Math: Unlocking the potential of
SLMs in Grade School Math](https://arxiv.org/pdf/2402.14830.pdf) for details about the dataset construction.
#### Who are the source data producers?
<!-- This section describes the people or systems who originally created the data. It should also include self-reported demographic or identity information for the source data creators if this information is available. -->
Microsoft
#### Annotation process
<!-- This section describes the annotation process such as annotation tools used in the process, the amount of data annotated, annotation guidelines provided to the annotators, interannotator statistics, annotation validation, etc. -->
We expanded a seed set of questions using Azure GPT-4 Trubo. The answers to those questions are generated using Azure GPT-4 Trubo.
#### Personal and Sensitive Information
<!-- State whether the dataset contains data that might be considered personal, sensitive, or private (e.g., data that reveals addresses, uniquely identifiable names or aliases, racial or ethnic origins, sexual orientations, religious beliefs, political opinions, financial or health data, etc.). If efforts were made to anonymize the data, describe the anonymization process. -->
None
## Bias, Risks, and Limitations
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
This dataset is in English and contains only math word problems.
## Citation
If you find this work useful in your method, you can cite the paper as below:
```
@misc{mitra2024orcamath,
title={Orca-Math: Unlocking the potential of SLMs in Grade School Math},
author={Arindam Mitra and Hamed Khanpour and Corby Rosset and Ahmed Awadallah},
year={2024},
eprint={2402.14830},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
## Dataset Card Contact
[Arindam Mitra](armitra@microsoft.com)
# 数据集卡片
<!-- 请提供数据集的简要概述。 -->
本数据集收录约20万道中小学数学应用题,所有答案均通过Azure GPT-4 Turbo生成。有关数据集构建的详细信息,请参阅论文《Orca-Math: Unlocking the potential of SLMs in Grade School Math》(https://arxiv.org/pdf/2402.14830.pdf)。
### 数据集概述
- **整理方**:微软(Microsoft)
- **自然语言语种**:英语
- **授权协议**:MIT
### 数据集来源
<!-- 请提供数据集的基础链接。 -->
- **代码仓库**:[microsoft/orca-math-word-problems-200k](https://huggingface.co/datasets/microsoft/orca-math-word-problems-200k)
- **论文**:[Orca-Math: Unlocking the potential of SLMs in Grade School Math](https://arxiv.org/pdf/2402.14830.pdf)
### 适用场景
<!-- 本节描述数据集的合适使用场景。 -->
本数据集旨在提升语言模型的数学解题能力,为语言模型在数学问题求解领域的优异表现提供坚实基础。
### 不适用场景
<!-- 本节说明误用、恶意使用,以及本数据集无法良好适配的使用场景。 -->
本数据集不适用于教育系统或教育机构。
## 数据集结构
### 数据示例
数据集中的典型条目由一道题目及其对应答案组成。以下为数据集中的一则示例:
python
{'question': 'In a highly contested election having multiple candidates, Mr. Jackson, one of the losing candidates, received 3,485,782 votes, which accounted for precisely 38.7 percent of all votes. To have achieved a victory, he would have needed to secure at least 51 percent of all votes. Approximately, what percent of the remaining unsecured votes would Mr. Jackson have needed to accumulate to reach this victory threshold?',
'answer': "First, let's find out the total number of votes cast in the election. Since Mr. Jackson received 38.7% of all votes, and that amounted to 3,485,782 votes, we can set up the following equation to find the total number of votes (T):
0.387 * T = 3,485,782
Now, solve for T:
T = 3,485,782 / 0.387
T ≈ 9,000,467 votes (total number of votes cast)
To win, Mr. Jackson would have needed 51% of the total votes. Let's calculate that amount:
0.51 * T = 0.51 * 9,000,467
0.51 * T ≈ 4,590,238 votes needed to win
Now, let's find out how many more votes Mr. Jackson needed to reach this winning threshold:
Votes needed to win - Votes Mr. Jackson received = Additional votes needed
4,590,238 - 3,485,782 = 1,104,456 additional votes needed
Now, let's find out what percentage of the remaining unsecured votes this number represents. The remaining unsecured votes are the votes that were not for Mr. Jackson, which is 100% - 38.7% = 61.3% of the total votes.
61.3% of the total votes is the remaining unsecured votes:
0.613 * T = 0.613 * 9,000,467
0.613 * T ≈ 5,514,686 votes were unsecured
Now, we'll calculate the percentage of these unsecured votes that the additional votes needed represent:
(Additional votes needed / Unsecured votes) * 100 = Percentage of unsecured votes needed
(1,104,456 / 5,514,686) * 100 ≈ 20.03%
So, Mr. Jackson would have needed approximately 20.03% of the remaining unsecured votes to reach the victory threshold of 51%."}
### 数据字段
本数据集包含以下字段:
- `question`:字符串类型,包含待解答的数学问题
- `answer`:字符串类型,包含对应问题的完整解答
### 数据拆分
本数据集仅划分为训练集,各拆分的样本量如下:
- `train`:200035条数据
本数据集的`DatasetDict`结构如下:
python
DatasetDict({
'train': Dataset({
features: ['question', 'answer'],
num_rows: 200035
})
})
`DatasetDict`中的每个拆分均包含一个`Dataset`对象,包含上述指定的字段与样本数量。
## 数据集构建
有关数据集构建的详细细节,请参阅论文《Orca-Math: Unlocking the potential of SLMs in Grade School Math》(https://arxiv.org/pdf/2402.14830.pdf)。
### 源数据来源
- [Lila](https://huggingface.co/datasets/allenai/lila)
- [DMath](https://arxiv.org/ftp/arxiv/papers/2106/2106.15772.pdf)
#### 数据收集与处理流程
有关数据集构建的详细细节,请参阅上述论文。
#### 源数据生产者
原始数据的生产者为微软。
#### 标注流程
我们通过Azure GPT-4 Turbo扩充了初始种子问题集,并使用Azure GPT-4 Turbo生成所有题目的答案。
#### 个人与敏感信息
本数据集未包含任何个人、敏感或隐私信息。
## 偏差、风险与局限
本数据集仅收录英语数学应用题,存在相应的技术与社会技术局限。
## 引用格式
若您的研究中使用了本数据集,请按以下格式引用该论文:
@misc{mitra2024orcamath,
title={Orca-Math: Unlocking the potential of SLMs in Grade School Math},
author={Arindam Mitra and Hamed Khanpour and Corby Rosset and Ahmed Awadallah},
year={2024},
eprint={2402.14830},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
## 数据集卡片联系人
[Arindam Mitra](armitra@microsoft.com)
提供机构:
maas
创建时间:
2025-07-22



