Math-Shepherd

Name: Math-Shepherd
Creator: maas
Published: 2026-01-06 16:17:45
License: 暂无描述

魔搭社区2026-01-06 更新2024-11-23 收录

下载链接：

https://modelscope.cn/datasets/AI-ModelScope/Math-Shepherd

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Card for Math-Shepherd Project Page: [Math-Shepherd](https://rain-motion-6ec.notion.site/Math-Shepherd-A-Label-Free-Step-by-Step-Verifier-for-LLMs-in-Mathematical-Reasoning-41b6e73c860840e08697d347f8889bac#08e86c6d44c4452ba0b78c7aaea5f4f7) Paper: https://arxiv.org/pdf/2312.08935.pdf # Data Loading ``` from datasets import load_dataset dataset = load_dataset("peiyi9979/Math-Shepherd") ``` # Data Instance Every instance consists of three data fields: "input," "label," and "task". 1. "input": problem + step-by-step solution, e.g., ``` If Buzz bought a pizza with 78 slices at a restaurant and then decided to share it with the waiter in the ratio of 5:8, with Buzz's ratio being 5, what's twenty less the number of slices of pizza that the waiter ate? Step 1: The total ratio representing the pizza is 5+8 = <<5+8=13>>13. ки Step 2: The waiter ate 13 x 8 / 13 = <<13*8/13=6>>6 slices of the pizza. ки Step 3: Buzz ate 78 - 6 = <<78-6=72>>72 slices of the pizza. ки Step 4: The waiter ate 20 less than the number of slices that Buzz ate which is 72 - 20 = 52. ки Step 5: The waiter ate 52 slices of the pizza. The answer is: 52 ки ``` 2. "label": problem + step-by-step solution with automatic label, e.g., ``` If Buzz bought a pizza with 78 slices at a restaurant and then decided to share it with the waiter in the ratio of 5:8, with Buzz's ratio being 5, what's twenty less the number of slices of pizza that the waiter ate? Step 1: The total ratio representing the pizza is 5+8 = <<5+8=13>>13. + Step 2: The waiter ate 13 x 8 / 13 = <<13*8/13=6>>6 slices of the pizza. - Step 3: Buzz ate 78 - 6 = <<78-6=72>>72 slices of the pizza. - Step 4: The waiter ate 20 less than the number of slices that Buzz ate which is 72 - 20 = 52. - Step 5: The waiter ate 52 slices of the pizza. The answer is: 52 - ``` 3. "task": `GSM8K` or `MATH`. NOTE: "`ки`" serves as a unique token denoting the position for predicting the step score. "`+`" signifies a good step, as it has the potential to lead towards the correct answer. "`-`" denotes a bad step. When we train PRMs, we only compute the loss of the positions of `ки`. # Models: We utilized internal code for step-wise PPO training, which cannot be open-sourced. We hope for your understanding. We provide the checkpoints of SFT, PRM, and RL models to help everyone reproduce our results. - Mistral-7b-sft: https://huggingface.co/peiyi9979/mistral-7b-sft - Mistral-7b-prm: https://huggingface.co/peiyi9979/math-shepherd-mistral-7b-prm - Mistral-7b-rl: https://huggingface.co/peiyi9979/math-shepherd-mistral-7b-rl

# 数学牧羊人（Math-Shepherd）数据集卡片项目页面：[Math-Shepherd](https://rain-motion-6ec.notion.site/Math-Shepherd-A-Label-Free-Step-by-Step-Verifier-for-LLMs-in-Mathematical-Reasoning-41b6e73c860840e08697d347f8889bac#08e86c6d44c4452ba0b78c7aaea5f4f7) 论文：https://arxiv.org/pdf/2312.08935.pdf ## 数据加载 from datasets import load_dataset dataset = load_dataset("peiyi9979/Math-Shepherd") ## 数据实例每个数据实例包含三个字段：`input`、`label`与`task`。 1. **`input`**：问题文本加逐步解题过程，示例如下：假设巴斯在餐厅购买了一份共78片的披萨，随后决定按照5:8的比例与服务员分享，其中巴斯的份额占比为5，请问服务员所吃披萨片数比78少20的结果是多少？步骤1：披萨对应的总比例为5+8 = <<5+8=13>>13。 ки 步骤2：服务员食用的披萨片数为13 × 8 / 13 = <<13*8/13=6>>6。 ки 步骤3：巴斯食用的披萨片数为78 - 6 = <<78-6=72>>72。 ки 步骤4：服务员食用的披萨片数比巴斯少20，即72 - 20 = 52。 ки 步骤5：服务员最终食用了52片披萨。答案为：52 ки 2. **`label`**：带有自动标注的问题文本加逐步解题过程，示例如下：假设巴斯在餐厅购买了一份共78片的披萨，随后决定按照5:8的比例与服务员分享，其中巴斯的份额占比为5，请问服务员所吃披萨片数比78少20的结果是多少？步骤1：披萨对应的总比例为5+8 = <<5+8=13>>13。 + 步骤2：服务员食用的披萨片数为13 × 8 / 13 = <<13*8/13=6>>6。 - 步骤3：巴斯食用的披萨片数为78 - 6 = <<78-6=72>>72。 - 步骤4：服务员食用的披萨片数比巴斯少20，即72 - 20 = 52。 - 步骤5：服务员最终食用了52片披萨。答案为：52 - 3. **`task`**：取值为`GSM8K`或`MATH`。 ### 注意事项 - `ки` 作为唯一标记符，用于标注步骤评分的预测位置。 - `+` 代表优质步骤，即该步骤可导向正确的解题结果。 - `-` 代表劣质步骤。 - 在训练过程奖励模型（Process Reward Model，PRM）时，仅对`ки`所在的位置计算损失函数。 ## 模型资源我们采用内部代码实现了逐步近端策略优化（Proximal Policy Optimization，PPO）训练，该代码无法开源，敬请谅解。为方便研究者复现实验结果，我们提供了监督微调（Supervised Fine-Tuning，SFT）、过程奖励模型（PRM）以及强化学习（Reinforcement Learning，RL）模型的检查点文件： - Mistral-7b-sft：https://huggingface.co/peiyi9979/mistral-7b-sft - Mistral-7b-prm：https://huggingface.co/peiyi9979/math-shepherd-mistral-7b-prm - Mistral-7b-rl：https://huggingface.co/peiyi9979/math-shepherd-mistral-7b-rl

提供机构：

maas

创建时间：

2024-11-20

5,000+

优质数据集

54 个

任务类型

进入经典数据集