HanningZhang/ER-PRM-Data

github2024-11-04 更新2024-11-28 收录

下载链接：

https://github.com/hanningzhang/prm

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集用于训练数学推理奖励模型，包含数学问题的文本描述和相应的奖励值列表。数据集中的每个条目都包含一个文本字段和一个值字段，值字段的长度与文本中的步骤标记数量相同，用于训练模型预测指定的过程奖励。

This dataset is designed for training mathematical reasoning reward models, and contains textual descriptions of mathematical problems along with corresponding reward value lists. Each entry in the dataset includes a text field and a value field, where the length of the value field matches the number of step markers in the text, and is used to train models to predict specified process rewards.

创建时间：

2024-10-31

原始信息汇总

数学推理奖励模型训练数据集

概述

该数据集用于训练数学推理奖励模型，类似于Math-Shepherd的奖励模型。

数据集准备

数据集以JSON格式提供，包含以下字段：

text: 描述数学推理问题的文本。
value: 与每个步骤对应的奖励值列表。

示例

json [ { "text": "Janet pays $40/hour for 3 hours per week of clarinet lessons..", "value": [0.8, 0.6, 0.3] }, { "text": "Val cuts a single watermelon into 40 slices..", "value": [0.5, 0.7, 0.2] } ]

完整示例

json { "text": "Janet pays $40/hour for 3 hours per week of clarinet lessons and $28/hour for 5 hours a week of piano lessons. How much more does she spend on piano lessons than clarinet lessons in a year? Step 1: Janet spends 3 hours + 5 hours = <<3+5=8>>8 hours per week on music lessons. киStep 2: She spends 40 * 3 = <<403=120>>120 on clarinet lessons per week. киStep 3: She spends 28 * 5 = <<285=140>>140 on piano lessons per week. киStep 4: Janet spends 120 + 140 = <<120+140=260>>260 on music lessons per week. киStep 5: She spends 260 * 52 = <<260*52=13520>>13520 on music lessons in a year. The answer is: 13520 ки", "value": [1.0, 0.8, 0.6, 0.7, 1.0] }

数据集分隔符

使用ки分隔每个步骤。

数据集样本

提供了一个示例数据集sample_dataset.json供参考。

完整数据集

完整数据集请访问HanningZhang/ER-PRM-Data on Huggingface。

训练代码

运行以下bash文件进行训练： bash bash kl_math_reward_ce.sh

训练过程中使用交叉熵损失函数。

评估

评估策略与Math-Shepherd类似，提供了一个示例评估代码。 python from transformers import AutoTokenizer from transformers import AutoModelForCausalLM import torch

good_token = + bad_token = - step_tag = ки

path = "HanningZhang/Llama3.1-Math-PRM" tokenizer = AutoTokenizer.from_pretrained(path) plus_tag_id = tokenizer.encode( +)[-1] minus_tag_id = tokenizer.encode( -)[-1]

candidate_tokens = [plus_tag_id,minus_tag_id] step_tag_id = tokenizer.encode(f"{step_tag}")[-1] # 12902 model = AutoModelForCausalLM.from_pretrained(path).eval()

question = """Janetu2019s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers market?""" output1 = """Step 1: Janets ducks lay 16 eggs per day. киStep 2: She eats three for breakfast every morning, so she has 16 - 3 = 13 eggs left. киStep 3: She bakes muffins for her friends every day with four eggs, so she has 13 - 4 = 9 eggs left. киStep 4: She sells the remainder at the farmers market daily for $2 per fresh duck egg, so she makes 9 * $2 = $18 every day at the farmers market. The answer is: 18 ки""" # 18 is right output2 = """Step 1: Janets ducks lay 16 eggs per day. киStep 2: She eats three for breakfast every morning, so she has 16 - 3 = 13 eggs left. киStep 3: She bakes muffins for her friends every day with four eggs, so she has 13 - 4 = 9 eggs left. киStep 4: She sells the remainder at the farmers market daily for $2 per fresh duck egg, so she makes 9 * $2 = $17 every day at the farmers market. The answer is: 17 ки""" # 17 is wrong

for output in [output1, output2]: input_for_prm = f"{question} {output}" input_id = torch.tensor([tokenizer.encode(input_for_prm)])

with torch.no_grad():
    logits = model(input_id).logits[:,:,candidate_tokens]
    scores = logits.softmax(dim=-1)[:,:,0] 
    step_scores = scores[input_id == step_tag_id]
    print(step_scores)

搜集汇总

数据集介绍

构建方式

在构建ER-PRM-Data数据集时，研究者们精心设计了包含数学推理步骤的文本数据，并为其分配了相应的奖励值。每个数据样本由一段描述数学问题的文本和一组数值列表组成，数值列表的长度与文本中步骤标记的数量一致。通过这种方式，数据集旨在训练奖励模型，使其能够预测并评估数学推理过程中的每一步骤。

特点

ER-PRM-Data数据集的显著特点在于其结构化的数据格式和明确的奖励机制。每个样本不仅包含详细的数学推理步骤，还附有相应的奖励值，这为模型提供了明确的训练目标。此外，数据集中的步骤标记清晰，便于模型识别和处理，从而提高了训练效率和模型的准确性。

使用方法

使用ER-PRM-Data数据集时，用户需准备符合指定格式的JSON文件，其中包含数学推理文本和对应的奖励值列表。随后，用户可以通过运行提供的训练脚本进行模型训练，该脚本采用交叉熵损失函数进行优化。在训练完成后，用户可以利用样本评估代码对模型进行测试，以验证其在数学推理任务中的表现。

背景与挑战

背景概述

ER-PRM-Data数据集由HanningZhang团队创建，专注于数学推理任务中的奖励模型训练。该数据集的核心研究问题是如何通过训练模型来预测和评估数学推理过程中的每一步骤的奖励值，从而提高数学问题的解决效率和准确性。该数据集的创建旨在推动人工智能在教育领域的应用，特别是在自动化辅导和评估系统中，具有重要的研究意义和实际应用价值。

当前挑战

ER-PRM-Data数据集在构建过程中面临的主要挑战包括：首先，如何确保每一步骤的奖励值能够准确反映数学推理的正确性和逻辑性，这是一个复杂且需要精细调整的过程。其次，数据集的规模和多样性也是一个重要挑战，需要涵盖各种类型的数学问题和解题步骤，以确保模型的泛化能力。此外，数据集的标注工作也具有一定的复杂性，需要专业知识和时间投入。

常用场景

经典使用场景

ER-PRM-Data数据集主要用于训练数学推理的奖励模型。通过提供包含逐步推理过程的文本和相应的奖励值，该数据集能够帮助模型学习如何评估数学问题的解决步骤。例如，数据集中的样本展示了如何计算Janet在音乐课程上的年度花费，每个步骤都由'ки'分隔，并附有奖励值，从而指导模型学习正确的推理路径。

衍生相关工作

ER-PRM-Data数据集的发布催生了一系列相关研究工作，特别是在数学教育和人工智能交叉领域。例如，有研究者基于该数据集开发了新的数学推理评估模型，进一步提升了模型的准确性和鲁棒性。此外，该数据集还被用于探索如何在教育场景中更有效地应用人工智能技术，推动了个性化学习和智能辅导系统的发展。

数据集最近研究