orca-math-word-problems-200k

Name: orca-math-word-problems-200k
Creator: maas
Published: 2026-01-08 10:03:13
License: 暂无描述

魔搭社区2026-01-08 更新2024-05-15 收录

下载链接：

https://modelscope.cn/datasets/microsoft/orca-math-word-problems-200k

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Card  This dataset contains ~200K grade school math word problems. All the answers in this dataset is generated using Azure GPT4-Turbo. Please refer to [Orca-Math: Unlocking the potential of SLMs in Grade School Math](https://arxiv.org/pdf/2402.14830.pdf) for details about the dataset construction. ### Dataset Description - **Curated by:** Microsoft - **Language(s) (NLP):** English - **License:** MIT ### Dataset Sources  - **Repository:** [microsoft/orca-math-word-problems-200k](https://huggingface.co/datasets/microsoft/orca-math-word-problems-200k) - **Paper:** [Orca-Math: Unlocking the potential of SLMs in Grade School Math](https://arxiv.org/pdf/2402.14830.pdf) ### Direct Use  This dataset has been designed to enhance the mathematical abilities of language models. It aims to provide a robust foundation for language models to excel in mathematical problem-solving. ### Out-of-Scope Use  This dataset is not intended for use in educational systems or organizations. ## Dataset Structure ### Data Instances A typical data entry in the dataset consists of a question and its corresponding answer. Below is an example from the dataset: ```python {'question': 'In a highly contested election having multiple candidates, Mr. Jackson, one of the losing candidates, received 3,485,782 votes, which accounted for precisely 38.7 percent of all votes. To have achieved a victory, he would have needed to secure at least 51 percent of all votes. Approximately, what percent of the remaining unsecured votes would Mr. Jackson have needed to accumulate to reach this victory threshold?', 'answer': "First, let's find out the total number of votes cast in the election. Since Mr. Jackson received 38.7% of all votes, and that amounted to 3,485,782 votes, we can set up the following equation to find the total number of votes (T):\n\n0.387 * T = 3,485,782\n\nNow, solve for T:\n\nT = 3,485,782 / 0.387\nT ≈ 9,000,467 votes (total number of votes cast)\n\nTo win, Mr. Jackson would have needed 51% of the total votes. Let's calculate that amount:\n\n0.51 * T = 0.51 * 9,000,467\n0.51 * T ≈ 4,590,238 votes needed to win\n\nNow, let's find out how many more votes Mr. Jackson needed to reach this winning threshold:\n\nVotes needed to win - Votes Mr. Jackson received = Additional votes needed\n4,590,238 - 3,485,782 = 1,104,456 additional votes needed\n\nNow, let's find out what percentage of the remaining unsecured votes this number represents. The remaining unsecured votes are the votes that were not for Mr. Jackson, which is 100% - 38.7% = 61.3% of the total votes.\n\n61.3% of the total votes is the remaining unsecured votes:\n\n0.613 * T = 0.613 * 9,000,467\n0.613 * T ≈ 5,514,686 votes were unsecured\n\nNow, we'll calculate the percentage of these unsecured votes that the additional votes needed represent:\n\n(Additional votes needed / Unsecured votes) * 100 = Percentage of unsecured votes needed\n(1,104,456 / 5,514,686) * 100 ≈ 20.03%\n\nSo, Mr. Jackson would have needed approximately 20.03% of the remaining unsecured votes to reach the victory threshold of 51%."} ``` ### Data Fields The dataset comprises the following fields: - `question`: a string containing the question to be answered. - `answer`: a string containing the answer to the corresponding question. ### Data Splits The dataset is split into a training set. The number of rows in each split is as follows: - `train`: 200,035 rows The `DatasetDict` structure for the dataset is as follows: ```python DatasetDict({ 'train': Dataset({ features: ['question', 'answer'], num_rows: 200035 }) }) ``` Each split in the `DatasetDict` contains a `Dataset` object with the specified features and number of rows. ## Dataset Creation Please refer to [Orca-Math: Unlocking the potential of SLMs in Grade School Math](https://arxiv.org/pdf/2402.14830.pdf) for details about the dataset construction. ### Source Data - [Lila](https://huggingface.co/datasets/allenai/lila) - [DMath](https://arxiv.org/ftp/arxiv/papers/2106/2106.15772.pdf) #### Data Collection and Processing  Please refer to [Orca-Math: Unlocking the potential of SLMs in Grade School Math](https://arxiv.org/pdf/2402.14830.pdf) for details about the dataset construction. #### Who are the source data producers?  Microsoft #### Annotation process  We expanded a seed set of questions using Azure GPT-4 Trubo. The answers to those questions are generated using Azure GPT-4 Trubo. #### Personal and Sensitive Information  None ## Bias, Risks, and Limitations  This dataset is in English and contains only math word problems. ## Citation If you find this work useful in your method, you can cite the paper as below: ``` @misc{mitra2024orcamath, title={Orca-Math: Unlocking the potential of SLMs in Grade School Math}, author={Arindam Mitra and Hamed Khanpour and Corby Rosset and Ahmed Awadallah}, year={2024}, eprint={2402.14830}, archivePrefix={arXiv}, primaryClass={cs.CL} } ``` ## Dataset Card Contact [Arindam Mitra](armitra@microsoft.com)

# 数据集卡片  本数据集收录约20万道中小学数学应用题，所有答案均通过Azure GPT-4 Turbo生成。有关数据集构建的详细信息，请参阅论文《Orca-Math: Unlocking the potential of SLMs in Grade School Math》（https://arxiv.org/pdf/2402.14830.pdf）。 ### 数据集概述 - **整理方**：微软（Microsoft） - **自然语言语种**：英语 - **授权协议**：MIT ### 数据集来源  - **代码仓库**：[microsoft/orca-math-word-problems-200k](https://huggingface.co/datasets/microsoft/orca-math-word-problems-200k) - **论文**：[Orca-Math: Unlocking the potential of SLMs in Grade School Math](https://arxiv.org/pdf/2402.14830.pdf) ### 适用场景  本数据集旨在提升语言模型的数学解题能力，为语言模型在数学问题求解领域的优异表现提供坚实基础。 ### 不适用场景  本数据集不适用于教育系统或教育机构。 ## 数据集结构 ### 数据示例数据集中的典型条目由一道题目及其对应答案组成。以下为数据集中的一则示例： python {'question': 'In a highly contested election having multiple candidates, Mr. Jackson, one of the losing candidates, received 3,485,782 votes, which accounted for precisely 38.7 percent of all votes. To have achieved a victory, he would have needed to secure at least 51 percent of all votes. Approximately, what percent of the remaining unsecured votes would Mr. Jackson have needed to accumulate to reach this victory threshold?', 'answer': "First, let's find out the total number of votes cast in the election. Since Mr. Jackson received 38.7% of all votes, and that amounted to 3,485,782 votes, we can set up the following equation to find the total number of votes (T): 0.387 * T = 3,485,782 Now, solve for T: T = 3,485,782 / 0.387 T ≈ 9,000,467 votes (total number of votes cast) To win, Mr. Jackson would have needed 51% of the total votes. Let's calculate that amount: 0.51 * T = 0.51 * 9,000,467 0.51 * T ≈ 4,590,238 votes needed to win Now, let's find out how many more votes Mr. Jackson needed to reach this winning threshold: Votes needed to win - Votes Mr. Jackson received = Additional votes needed 4,590,238 - 3,485,782 = 1,104,456 additional votes needed Now, let's find out what percentage of the remaining unsecured votes this number represents. The remaining unsecured votes are the votes that were not for Mr. Jackson, which is 100% - 38.7% = 61.3% of the total votes. 61.3% of the total votes is the remaining unsecured votes: 0.613 * T = 0.613 * 9,000,467 0.613 * T ≈ 5,514,686 votes were unsecured Now, we'll calculate the percentage of these unsecured votes that the additional votes needed represent: (Additional votes needed / Unsecured votes) * 100 = Percentage of unsecured votes needed (1,104,456 / 5,514,686) * 100 ≈ 20.03% So, Mr. Jackson would have needed approximately 20.03% of the remaining unsecured votes to reach the victory threshold of 51%."} ### 数据字段本数据集包含以下字段： - `question`：字符串类型，包含待解答的数学问题 - `answer`：字符串类型，包含对应问题的完整解答 ### 数据拆分本数据集仅划分为训练集，各拆分的样本量如下： - `train`：200035条数据本数据集的`DatasetDict`结构如下： python DatasetDict({ 'train': Dataset({ features: ['question', 'answer'], num_rows: 200035 }) }) `DatasetDict`中的每个拆分均包含一个`Dataset`对象，包含上述指定的字段与样本数量。 ## 数据集构建有关数据集构建的详细细节，请参阅论文《Orca-Math: Unlocking the potential of SLMs in Grade School Math》（https://arxiv.org/pdf/2402.14830.pdf）。 ### 源数据来源 - [Lila](https://huggingface.co/datasets/allenai/lila) - [DMath](https://arxiv.org/ftp/arxiv/papers/2106/2106.15772.pdf) #### 数据收集与处理流程有关数据集构建的详细细节，请参阅上述论文。 #### 源数据生产者原始数据的生产者为微软。 #### 标注流程我们通过Azure GPT-4 Turbo扩充了初始种子问题集，并使用Azure GPT-4 Turbo生成所有题目的答案。 #### 个人与敏感信息本数据集未包含任何个人、敏感或隐私信息。 ## 偏差、风险与局限本数据集仅收录英语数学应用题，存在相应的技术与社会技术局限。 ## 引用格式若您的研究中使用了本数据集，请按以下格式引用该论文： @misc{mitra2024orcamath, title={Orca-Math: Unlocking the potential of SLMs in Grade School Math}, author={Arindam Mitra and Hamed Khanpour and Corby Rosset and Ahmed Awadallah}, year={2024}, eprint={2402.14830}, archivePrefix={arXiv}, primaryClass={cs.CL} } ## 数据集卡片联系人 [Arindam Mitra](armitra@microsoft.com)

提供机构：

maas

创建时间：

2025-07-22

搜集汇总

数据集介绍