HanningZhang/ER-PRM-Data
收藏数学推理奖励模型训练数据集
概述
该数据集用于训练数学推理奖励模型,类似于Math-Shepherd的奖励模型。
数据集准备
数据集以JSON格式提供,包含以下字段:
text: 描述数学推理问题的文本。value: 与每个步骤对应的奖励值列表。
示例
json [ { "text": "Janet pays $40/hour for 3 hours per week of clarinet lessons..", "value": [0.8, 0.6, 0.3] }, { "text": "Val cuts a single watermelon into 40 slices..", "value": [0.5, 0.7, 0.2] } ]
完整示例
json { "text": "Janet pays $40/hour for 3 hours per week of clarinet lessons and $28/hour for 5 hours a week of piano lessons. How much more does she spend on piano lessons than clarinet lessons in a year? Step 1: Janet spends 3 hours + 5 hours = <<3+5=8>>8 hours per week on music lessons. киStep 2: She spends 40 * 3 = <<403=120>>120 on clarinet lessons per week. киStep 3: She spends 28 * 5 = <<285=140>>140 on piano lessons per week. киStep 4: Janet spends 120 + 140 = <<120+140=260>>260 on music lessons per week. киStep 5: She spends 260 * 52 = <<260*52=13520>>13520 on music lessons in a year. The answer is: 13520 ки", "value": [1.0, 0.8, 0.6, 0.7, 1.0] }
数据集分隔符
使用ки分隔每个步骤。
数据集样本
提供了一个示例数据集sample_dataset.json供参考。
完整数据集
完整数据集请访问HanningZhang/ER-PRM-Data on Huggingface。
训练代码
运行以下bash文件进行训练: bash bash kl_math_reward_ce.sh
训练过程中使用交叉熵损失函数。
评估
评估策略与Math-Shepherd类似,提供了一个示例评估代码。 python from transformers import AutoTokenizer from transformers import AutoModelForCausalLM import torch
good_token = + bad_token = - step_tag = ки
path = "HanningZhang/Llama3.1-Math-PRM" tokenizer = AutoTokenizer.from_pretrained(path) plus_tag_id = tokenizer.encode( +)[-1] minus_tag_id = tokenizer.encode( -)[-1]
candidate_tokens = [plus_tag_id,minus_tag_id] step_tag_id = tokenizer.encode(f"{step_tag}")[-1] # 12902 model = AutoModelForCausalLM.from_pretrained(path).eval()
question = """Janetu2019s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers market?""" output1 = """Step 1: Janets ducks lay 16 eggs per day. киStep 2: She eats three for breakfast every morning, so she has 16 - 3 = 13 eggs left. киStep 3: She bakes muffins for her friends every day with four eggs, so she has 13 - 4 = 9 eggs left. киStep 4: She sells the remainder at the farmers market daily for $2 per fresh duck egg, so she makes 9 * $2 = $18 every day at the farmers market. The answer is: 18 ки""" # 18 is right output2 = """Step 1: Janets ducks lay 16 eggs per day. киStep 2: She eats three for breakfast every morning, so she has 16 - 3 = 13 eggs left. киStep 3: She bakes muffins for her friends every day with four eggs, so she has 13 - 4 = 9 eggs left. киStep 4: She sells the remainder at the farmers market daily for $2 per fresh duck egg, so she makes 9 * $2 = $17 every day at the farmers market. The answer is: 17 ки""" # 17 is wrong
for output in [output1, output2]: input_for_prm = f"{question} {output}" input_id = torch.tensor([tokenizer.encode(input_for_prm)])
with torch.no_grad():
logits = model(input_id).logits[:,:,candidate_tokens]
scores = logits.softmax(dim=-1)[:,:,0]
step_scores = scores[input_id == step_tag_id]
print(step_scores)




