sibasmarakp/Qwen2.5-Math-7B-Instruct-Qwen2.5-Math-PRM-7B-best_of_n-completions
收藏Hugging Face2026-03-28 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/sibasmarakp/Qwen2.5-Math-7B-Instruct-Qwen2.5-Math-PRM-7B-best_of_n-completions
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
- config_name: minervamath--T-0.7--top_p-0.8--n-8--seed-0--agg_strategy-last
features:
- name: problem
dtype: string
- name: answer
dtype: string
- name: completions
list: string
- name: scores
list:
list: float64
- name: pred
dtype: string
- name: completion_tokens
list: int64
- name: agg_scores
list: float64
- name: pred_weighted@1
dtype: string
- name: pred_maj@1
dtype: string
- name: pred_naive@1
dtype: string
- name: pred_weighted@2
dtype: string
- name: pred_maj@2
dtype: string
- name: pred_naive@2
dtype: string
- name: pred_weighted@4
dtype: string
- name: pred_maj@4
dtype: string
- name: pred_naive@4
dtype: string
- name: pred_weighted@8
dtype: string
- name: pred_maj@8
dtype: string
- name: pred_naive@8
dtype: string
splits:
- name: train
num_bytes: 5567372
num_examples: 272
download_size: 5227662
dataset_size: 5567372
- config_name: minervamath--T-0.7--top_p-0.8--n-8--seed-1--agg_strategy-last
features:
- name: problem
dtype: string
- name: answer
dtype: string
- name: completions
list: string
- name: scores
list:
list: float64
- name: pred
dtype: string
- name: completion_tokens
list: int64
- name: agg_scores
list: float64
- name: pred_weighted@1
dtype: string
- name: pred_maj@1
dtype: string
- name: pred_naive@1
dtype: string
- name: pred_weighted@2
dtype: string
- name: pred_maj@2
dtype: string
- name: pred_naive@2
dtype: string
- name: pred_weighted@4
dtype: string
- name: pred_maj@4
dtype: string
- name: pred_naive@4
dtype: string
- name: pred_weighted@8
dtype: string
- name: pred_maj@8
dtype: string
- name: pred_naive@8
dtype: string
splits:
- name: train
num_bytes: 5575761
num_examples: 272
download_size: 5209226
dataset_size: 5575761
- config_name: minervamath--T-0.7--top_p-0.8--n-8--seed-2--agg_strategy-last
features:
- name: problem
dtype: string
- name: answer
dtype: string
- name: completions
list: string
- name: scores
list:
list: float64
- name: pred
dtype: string
- name: completion_tokens
list: int64
- name: agg_scores
list: float64
- name: pred_weighted@1
dtype: string
- name: pred_maj@1
dtype: string
- name: pred_naive@1
dtype: string
- name: pred_weighted@2
dtype: string
- name: pred_maj@2
dtype: string
- name: pred_naive@2
dtype: string
- name: pred_weighted@4
dtype: string
- name: pred_maj@4
dtype: string
- name: pred_naive@4
dtype: string
- name: pred_weighted@8
dtype: string
- name: pred_maj@8
dtype: string
- name: pred_naive@8
dtype: string
splits:
- name: train
num_bytes: 5523999
num_examples: 272
download_size: 5142967
dataset_size: 5523999
- config_name: rebuttal-MATH500--T-0.7--top_p-0.8--n-8--seed-0--agg_strategy-last
features:
- name: problem
dtype: string
- name: solution
dtype: string
- name: answer
dtype: string
- name: subject
dtype: string
- name: level
dtype: int64
- name: unique_id
dtype: string
- name: completions
list: string
- name: scores
list:
list: float64
- name: pred
dtype: string
- name: completion_tokens
list: int64
- name: agg_scores
list: float64
- name: pred_weighted@1
dtype: string
- name: pred_maj@1
dtype: string
- name: pred_naive@1
dtype: string
- name: pred_weighted@2
dtype: string
- name: pred_maj@2
dtype: string
- name: pred_naive@2
dtype: string
- name: pred_weighted@4
dtype: string
- name: pred_maj@4
dtype: string
- name: pred_naive@4
dtype: string
- name: pred_weighted@8
dtype: string
- name: pred_maj@8
dtype: string
- name: pred_naive@8
dtype: string
splits:
- name: train
num_bytes: 8692078
num_examples: 500
download_size: 8003652
dataset_size: 8692078
- config_name: rebuttal-MATH500--T-0.7--top_p-0.8--n-8--seed-0--agg_strategy-last--evals
features:
- name: n
dtype: int64
- name: acc_naive
dtype: float64
- name: acc_weighted
dtype: float64
- name: acc_maj
dtype: float64
splits:
- name: train
num_bytes: 128
num_examples: 4
download_size: 2225
dataset_size: 128
- config_name: rebuttal-MATH500--T-0.7--top_p-0.8--n-8--seed-1--agg_strategy-last
features:
- name: problem
dtype: string
- name: solution
dtype: string
- name: answer
dtype: string
- name: subject
dtype: string
- name: level
dtype: int64
- name: unique_id
dtype: string
- name: completions
list: string
- name: scores
list:
list: float64
- name: pred
dtype: string
- name: completion_tokens
list: int64
- name: agg_scores
list: float64
- name: pred_weighted@1
dtype: string
- name: pred_maj@1
dtype: string
- name: pred_naive@1
dtype: string
- name: pred_weighted@2
dtype: string
- name: pred_maj@2
dtype: string
- name: pred_naive@2
dtype: string
- name: pred_weighted@4
dtype: string
- name: pred_maj@4
dtype: string
- name: pred_naive@4
dtype: string
- name: pred_weighted@8
dtype: string
- name: pred_maj@8
dtype: string
- name: pred_naive@8
dtype: string
splits:
- name: train
num_bytes: 8716071
num_examples: 500
download_size: 8034556
dataset_size: 8716071
- config_name: rebuttal-MATH500--T-0.7--top_p-0.8--n-8--seed-1--agg_strategy-last--evals
features:
- name: n
dtype: int64
- name: acc_naive
dtype: float64
- name: acc_weighted
dtype: float64
- name: acc_maj
dtype: float64
splits:
- name: train
num_bytes: 128
num_examples: 4
download_size: 2219
dataset_size: 128
- config_name: rebuttal-MATH500--T-0.7--top_p-0.8--n-8--seed-2--agg_strategy-last
features:
- name: problem
dtype: string
- name: solution
dtype: string
- name: answer
dtype: string
- name: subject
dtype: string
- name: level
dtype: int64
- name: unique_id
dtype: string
- name: completions
list: string
- name: scores
list:
list: float64
- name: pred
dtype: string
- name: completion_tokens
list: int64
- name: agg_scores
list: float64
- name: pred_weighted@1
dtype: string
- name: pred_maj@1
dtype: string
- name: pred_naive@1
dtype: string
- name: pred_weighted@2
dtype: string
- name: pred_maj@2
dtype: string
- name: pred_naive@2
dtype: string
- name: pred_weighted@4
dtype: string
- name: pred_maj@4
dtype: string
- name: pred_naive@4
dtype: string
- name: pred_weighted@8
dtype: string
- name: pred_maj@8
dtype: string
- name: pred_naive@8
dtype: string
splits:
- name: train
num_bytes: 8634089
num_examples: 500
download_size: 7942429
dataset_size: 8634089
- config_name: rebuttal-MATH500--T-0.7--top_p-0.8--n-8--seed-2--agg_strategy-last--evals
features:
- name: n
dtype: int64
- name: acc_naive
dtype: float64
- name: acc_weighted
dtype: float64
- name: acc_maj
dtype: float64
splits:
- name: train
num_bytes: 128
num_examples: 4
download_size: 2221
dataset_size: 128
- config_name: rebuttal-OlympiadBench--T-0.7--top_p-0.8--n-8--seed-0--agg_strategy-last
features:
- name: id
dtype: int64
- name: problem
dtype: string
- name: solution
list: string
- name: answer
list: string
- name: context
dtype: 'null'
- name: image_1
dtype: 'null'
- name: image_2
dtype: 'null'
- name: image_3
dtype: 'null'
- name: image_4
dtype: 'null'
- name: image_5
dtype: 'null'
- name: image_6
dtype: 'null'
- name: image_7
dtype: 'null'
- name: image_8
dtype: 'null'
- name: image_9
dtype: 'null'
- name: modality
dtype: string
- name: difficulty
dtype: string
- name: is_multiple_answer
dtype: bool
- name: unit
dtype: string
- name: answer_type
dtype: string
- name: error
dtype: string
- name: question_type
dtype: string
- name: subfield
dtype: string
- name: subject
dtype: string
- name: language
dtype: string
- name: completions
list: string
- name: scores
list:
list: float64
- name: pred
dtype: string
- name: completion_tokens
list: int64
- name: agg_scores
list: float64
- name: pred_weighted@1
dtype: string
- name: pred_maj@1
dtype: string
- name: pred_naive@1
dtype: string
- name: pred_weighted@2
dtype: string
- name: pred_maj@2
dtype: string
- name: pred_naive@2
dtype: string
- name: pred_weighted@4
dtype: string
- name: pred_maj@4
dtype: string
- name: pred_naive@4
dtype: string
- name: pred_weighted@8
dtype: string
- name: pred_maj@8
dtype: string
- name: pred_naive@8
dtype: string
splits:
- name: train
num_bytes: 18274797
num_examples: 674
download_size: 17197166
dataset_size: 18274797
- config_name: rebuttal-OlympiadBench--T-0.7--top_p-0.8--n-8--seed-0--agg_strategy-last--evals
features:
- name: n
dtype: int64
- name: acc_naive
dtype: float64
- name: acc_weighted
dtype: float64
- name: acc_maj
dtype: float64
splits:
- name: train
num_bytes: 128
num_examples: 4
download_size: 2231
dataset_size: 128
- config_name: rebuttal-OlympiadBench--T-0.7--top_p-0.8--n-8--seed-1--agg_strategy-last
features:
- name: id
dtype: int64
- name: problem
dtype: string
- name: solution
list: string
- name: answer
list: string
- name: context
dtype: 'null'
- name: image_1
dtype: 'null'
- name: image_2
dtype: 'null'
- name: image_3
dtype: 'null'
- name: image_4
dtype: 'null'
- name: image_5
dtype: 'null'
- name: image_6
dtype: 'null'
- name: image_7
dtype: 'null'
- name: image_8
dtype: 'null'
- name: image_9
dtype: 'null'
- name: modality
dtype: string
- name: difficulty
dtype: string
- name: is_multiple_answer
dtype: bool
- name: unit
dtype: string
- name: answer_type
dtype: string
- name: error
dtype: string
- name: question_type
dtype: string
- name: subfield
dtype: string
- name: subject
dtype: string
- name: language
dtype: string
- name: completions
list: string
- name: scores
list:
list: float64
- name: pred
dtype: string
- name: completion_tokens
list: int64
- name: agg_scores
list: float64
- name: pred_weighted@1
dtype: string
- name: pred_maj@1
dtype: string
- name: pred_naive@1
dtype: string
- name: pred_weighted@2
dtype: string
- name: pred_maj@2
dtype: string
- name: pred_naive@2
dtype: string
- name: pred_weighted@4
dtype: string
- name: pred_maj@4
dtype: string
- name: pred_naive@4
dtype: string
- name: pred_weighted@8
dtype: string
- name: pred_maj@8
dtype: string
- name: pred_naive@8
dtype: string
splits:
- name: train
num_bytes: 18174947
num_examples: 674
download_size: 17100163
dataset_size: 18174947
- config_name: rebuttal-OlympiadBench--T-0.7--top_p-0.8--n-8--seed-1--agg_strategy-last--evals
features:
- name: n
dtype: int64
- name: acc_naive
dtype: float64
- name: acc_weighted
dtype: float64
- name: acc_maj
dtype: float64
splits:
- name: train
num_bytes: 128
num_examples: 4
download_size: 2223
dataset_size: 128
- config_name: rebuttal-OlympiadBench--T-0.7--top_p-0.8--n-8--seed-2--agg_strategy-last
features:
- name: id
dtype: int64
- name: problem
dtype: string
- name: solution
list: string
- name: answer
list: string
- name: context
dtype: 'null'
- name: image_1
dtype: 'null'
- name: image_2
dtype: 'null'
- name: image_3
dtype: 'null'
- name: image_4
dtype: 'null'
- name: image_5
dtype: 'null'
- name: image_6
dtype: 'null'
- name: image_7
dtype: 'null'
- name: image_8
dtype: 'null'
- name: image_9
dtype: 'null'
- name: modality
dtype: string
- name: difficulty
dtype: string
- name: is_multiple_answer
dtype: bool
- name: unit
dtype: string
- name: answer_type
dtype: string
- name: error
dtype: string
- name: question_type
dtype: string
- name: subfield
dtype: string
- name: subject
dtype: string
- name: language
dtype: string
- name: completions
list: string
- name: scores
list:
list: float64
- name: pred
dtype: string
- name: completion_tokens
list: int64
- name: agg_scores
list: float64
- name: pred_weighted@1
dtype: string
- name: pred_maj@1
dtype: string
- name: pred_naive@1
dtype: string
- name: pred_weighted@2
dtype: string
- name: pred_maj@2
dtype: string
- name: pred_naive@2
dtype: string
- name: pred_weighted@4
dtype: string
- name: pred_maj@4
dtype: string
- name: pred_naive@4
dtype: string
- name: pred_weighted@8
dtype: string
- name: pred_maj@8
dtype: string
- name: pred_naive@8
dtype: string
splits:
- name: train
num_bytes: 18429551
num_examples: 674
download_size: 17355301
dataset_size: 18429551
- config_name: rebuttal-OlympiadBench--T-0.7--top_p-0.8--n-8--seed-2--agg_strategy-last--evals
features:
- name: n
dtype: int64
- name: acc_naive
dtype: float64
- name: acc_weighted
dtype: float64
- name: acc_maj
dtype: float64
splits:
- name: train
num_bytes: 128
num_examples: 4
download_size: 2231
dataset_size: 128
- config_name: rebuttal-minervamath--T-0.7--top_p-0.8--n-8--seed-0--agg_strategy-last
features:
- name: problem
dtype: string
- name: answer
dtype: string
- name: completions
list: string
- name: scores
list:
list: float64
- name: pred
dtype: string
- name: completion_tokens
list: int64
- name: agg_scores
list: float64
- name: pred_weighted@1
dtype: string
- name: pred_maj@1
dtype: string
- name: pred_naive@1
dtype: string
- name: pred_weighted@2
dtype: string
- name: pred_maj@2
dtype: string
- name: pred_naive@2
dtype: string
- name: pred_weighted@4
dtype: string
- name: pred_maj@4
dtype: string
- name: pred_naive@4
dtype: string
- name: pred_weighted@8
dtype: string
- name: pred_maj@8
dtype: string
- name: pred_naive@8
dtype: string
splits:
- name: train
num_bytes: 5567372
num_examples: 272
download_size: 5227662
dataset_size: 5567372
- config_name: rebuttal-minervamath--T-0.7--top_p-0.8--n-8--seed-0--agg_strategy-last--evals
features:
- name: n
dtype: int64
- name: acc_naive
dtype: float64
- name: acc_weighted
dtype: float64
- name: acc_maj
dtype: float64
splits:
- name: train
num_bytes: 128
num_examples: 4
download_size: 2229
dataset_size: 128
- config_name: rebuttal-minervamath--T-0.7--top_p-0.8--n-8--seed-1--agg_strategy-last
features:
- name: problem
dtype: string
- name: answer
dtype: string
- name: completions
list: string
- name: scores
list:
list: float64
- name: pred
dtype: string
- name: completion_tokens
list: int64
- name: agg_scores
list: float64
- name: pred_weighted@1
dtype: string
- name: pred_maj@1
dtype: string
- name: pred_naive@1
dtype: string
- name: pred_weighted@2
dtype: string
- name: pred_maj@2
dtype: string
- name: pred_naive@2
dtype: string
- name: pred_weighted@4
dtype: string
- name: pred_maj@4
dtype: string
- name: pred_naive@4
dtype: string
- name: pred_weighted@8
dtype: string
- name: pred_maj@8
dtype: string
- name: pred_naive@8
dtype: string
splits:
- name: train
num_bytes: 5575761
num_examples: 272
download_size: 5209226
dataset_size: 5575761
- config_name: rebuttal-minervamath--T-0.7--top_p-0.8--n-8--seed-1--agg_strategy-last--evals
features:
- name: n
dtype: int64
- name: acc_naive
dtype: float64
- name: acc_weighted
dtype: float64
- name: acc_maj
dtype: float64
splits:
- name: train
num_bytes: 128
num_examples: 4
download_size: 2226
dataset_size: 128
- config_name: rebuttal-minervamath--T-0.7--top_p-0.8--n-8--seed-2--agg_strategy-last
features:
- name: problem
dtype: string
- name: answer
dtype: string
- name: completions
list: string
- name: scores
list:
list: float64
- name: pred
dtype: string
- name: completion_tokens
list: int64
- name: agg_scores
list: float64
- name: pred_weighted@1
dtype: string
- name: pred_maj@1
dtype: string
- name: pred_naive@1
dtype: string
- name: pred_weighted@2
dtype: string
- name: pred_maj@2
dtype: string
- name: pred_naive@2
dtype: string
- name: pred_weighted@4
dtype: string
- name: pred_maj@4
dtype: string
- name: pred_naive@4
dtype: string
- name: pred_weighted@8
dtype: string
- name: pred_maj@8
dtype: string
- name: pred_naive@8
dtype: string
splits:
- name: train
num_bytes: 5523999
num_examples: 272
download_size: 5142967
dataset_size: 5523999
- config_name: rebuttal-minervamath--T-0.7--top_p-0.8--n-8--seed-2--agg_strategy-last--evals
features:
- name: n
dtype: int64
- name: acc_naive
dtype: float64
- name: acc_weighted
dtype: float64
- name: acc_maj
dtype: float64
splits:
- name: train
num_bytes: 128
num_examples: 4
download_size: 2227
dataset_size: 128
configs:
- config_name: minervamath--T-0.7--top_p-0.8--n-8--seed-0--agg_strategy-last
data_files:
- split: train
path: minervamath--T-0.7--top_p-0.8--n-8--seed-0--agg_strategy-last/train-*
- config_name: minervamath--T-0.7--top_p-0.8--n-8--seed-1--agg_strategy-last
data_files:
- split: train
path: minervamath--T-0.7--top_p-0.8--n-8--seed-1--agg_strategy-last/train-*
- config_name: minervamath--T-0.7--top_p-0.8--n-8--seed-2--agg_strategy-last
data_files:
- split: train
path: minervamath--T-0.7--top_p-0.8--n-8--seed-2--agg_strategy-last/train-*
- config_name: rebuttal-MATH500--T-0.7--top_p-0.8--n-8--seed-0--agg_strategy-last
data_files:
- split: train
path: rebuttal-MATH500--T-0.7--top_p-0.8--n-8--seed-0--agg_strategy-last/train-*
- config_name: rebuttal-MATH500--T-0.7--top_p-0.8--n-8--seed-0--agg_strategy-last--evals
data_files:
- split: train
path: rebuttal-MATH500--T-0.7--top_p-0.8--n-8--seed-0--agg_strategy-last--evals/train-*
- config_name: rebuttal-MATH500--T-0.7--top_p-0.8--n-8--seed-1--agg_strategy-last
data_files:
- split: train
path: rebuttal-MATH500--T-0.7--top_p-0.8--n-8--seed-1--agg_strategy-last/train-*
- config_name: rebuttal-MATH500--T-0.7--top_p-0.8--n-8--seed-1--agg_strategy-last--evals
data_files:
- split: train
path: rebuttal-MATH500--T-0.7--top_p-0.8--n-8--seed-1--agg_strategy-last--evals/train-*
- config_name: rebuttal-MATH500--T-0.7--top_p-0.8--n-8--seed-2--agg_strategy-last
data_files:
- split: train
path: rebuttal-MATH500--T-0.7--top_p-0.8--n-8--seed-2--agg_strategy-last/train-*
- config_name: rebuttal-MATH500--T-0.7--top_p-0.8--n-8--seed-2--agg_strategy-last--evals
data_files:
- split: train
path: rebuttal-MATH500--T-0.7--top_p-0.8--n-8--seed-2--agg_strategy-last--evals/train-*
- config_name: rebuttal-OlympiadBench--T-0.7--top_p-0.8--n-8--seed-0--agg_strategy-last
data_files:
- split: train
path: rebuttal-OlympiadBench--T-0.7--top_p-0.8--n-8--seed-0--agg_strategy-last/train-*
- config_name: rebuttal-OlympiadBench--T-0.7--top_p-0.8--n-8--seed-0--agg_strategy-last--evals
data_files:
- split: train
path: rebuttal-OlympiadBench--T-0.7--top_p-0.8--n-8--seed-0--agg_strategy-last--evals/train-*
- config_name: rebuttal-OlympiadBench--T-0.7--top_p-0.8--n-8--seed-1--agg_strategy-last
data_files:
- split: train
path: rebuttal-OlympiadBench--T-0.7--top_p-0.8--n-8--seed-1--agg_strategy-last/train-*
- config_name: rebuttal-OlympiadBench--T-0.7--top_p-0.8--n-8--seed-1--agg_strategy-last--evals
data_files:
- split: train
path: rebuttal-OlympiadBench--T-0.7--top_p-0.8--n-8--seed-1--agg_strategy-last--evals/train-*
- config_name: rebuttal-OlympiadBench--T-0.7--top_p-0.8--n-8--seed-2--agg_strategy-last
data_files:
- split: train
path: rebuttal-OlympiadBench--T-0.7--top_p-0.8--n-8--seed-2--agg_strategy-last/train-*
- config_name: rebuttal-OlympiadBench--T-0.7--top_p-0.8--n-8--seed-2--agg_strategy-last--evals
data_files:
- split: train
path: rebuttal-OlympiadBench--T-0.7--top_p-0.8--n-8--seed-2--agg_strategy-last--evals/train-*
- config_name: rebuttal-minervamath--T-0.7--top_p-0.8--n-8--seed-0--agg_strategy-last
data_files:
- split: train
path: rebuttal-minervamath--T-0.7--top_p-0.8--n-8--seed-0--agg_strategy-last/train-*
- config_name: rebuttal-minervamath--T-0.7--top_p-0.8--n-8--seed-0--agg_strategy-last--evals
data_files:
- split: train
path: rebuttal-minervamath--T-0.7--top_p-0.8--n-8--seed-0--agg_strategy-last--evals/train-*
- config_name: rebuttal-minervamath--T-0.7--top_p-0.8--n-8--seed-1--agg_strategy-last
data_files:
- split: train
path: rebuttal-minervamath--T-0.7--top_p-0.8--n-8--seed-1--agg_strategy-last/train-*
- config_name: rebuttal-minervamath--T-0.7--top_p-0.8--n-8--seed-1--agg_strategy-last--evals
data_files:
- split: train
path: rebuttal-minervamath--T-0.7--top_p-0.8--n-8--seed-1--agg_strategy-last--evals/train-*
- config_name: rebuttal-minervamath--T-0.7--top_p-0.8--n-8--seed-2--agg_strategy-last
data_files:
- split: train
path: rebuttal-minervamath--T-0.7--top_p-0.8--n-8--seed-2--agg_strategy-last/train-*
- config_name: rebuttal-minervamath--T-0.7--top_p-0.8--n-8--seed-2--agg_strategy-last--evals
data_files:
- split: train
path: rebuttal-minervamath--T-0.7--top_p-0.8--n-8--seed-2--agg_strategy-last--evals/train-*
---
提供机构:
sibasmarakp
搜集汇总
数据集介绍

构建方式
在数学推理领域,提升大型语言模型的解题能力是当前研究的热点。该数据集通过Qwen2.5-Math-7B-Instruct模型,在多个数学基准上生成多样化的解题补全序列,并利用Qwen2.5-Math-PRM-7B模型进行评分与聚合。具体构建过程涉及对MinervaMath、MATH500和OlympiadBench等数学问题集,采用温度采样与top-p截断策略,为每个问题生成多个候选解答,再通过奖励模型评估每个解答的质量分数,最终依据不同聚合策略得出最优预测。
特点
该数据集的核心特征在于其结构化地记录了模型推理过程中的多路径探索与评估结果。每个数据样本不仅包含原始数学问题与标准答案,还详尽保存了模型生成的多个补全序列、对应的评分列表以及基于加权、多数表决等不同策略的聚合预测。这种设计使得数据集能够支持对模型不确定性、答案一致性以及评分机制有效性的深入分析,为研究数学推理中的集成方法与自我改进提供了丰富的实验素材。
使用方法
研究者可利用该数据集进行多方面的探索,例如分析不同聚合策略对最终答案准确率的影响,或探究奖励模型评分与答案正确性之间的关联。数据集中的补全序列与评分信息可直接用于训练或评估新的答案选择或集成学习算法。通过加载特定的配置,用户可以访问不同种子或基准下的数据,利用`pred_weighted@n`、`pred_maj@n`等字段比较不同采样规模下的性能,或结合评估配置中的准确率指标进行模型行为的量化研究。
背景与挑战
背景概述
在大型语言模型数学推理能力评估领域,Qwen2.5-Math-7B-Instruct-Qwen2.5-Math-PRM-7B-best_of_n-completions数据集应运而生,旨在系统性地探索模型在复杂数学问题求解中的表现。该数据集由Qwen团队构建,依托于MinervaMath、MATH500及OlympiadBench等多个权威数学基准,通过生成多个候选答案并利用偏好评分模型进行排序,核心研究问题聚焦于提升模型输出的准确性与可靠性。其创建标志着数学推理评估从单一答案匹配迈向多路径择优的新阶段,为模型决策过程的可解释性与鲁棒性研究提供了关键数据支撑。
当前挑战
该数据集致力于解决数学推理中模型输出不稳定与错误率高的核心挑战,通过多答案生成与评分机制优化最终预测。构建过程中面临多重困难:首先,数学问题涵盖代数、几何、奥赛等多领域,需确保问题多样性与难度分层;其次,生成多个候选答案时需平衡创造性探索与逻辑一致性,避免无效或重复输出;再者,偏好评分模型的训练依赖高质量的人类反馈数据,其标注成本高昂且易引入主观偏差;最后,不同聚合策略(如加权投票、多数表决)的性能评估需在多个种子设置下进行稳健性验证,以保障结论的统计显著性。
常用场景
经典使用场景
在数学推理与大型语言模型评估领域,该数据集通过提供多个数学问题的生成式完成序列及其评分,为研究者深入探究模型在复杂数学任务中的表现提供了关键资源。其经典使用场景在于系统评估和比较不同聚合策略(如加权投票、多数投票和朴素选择)在提升模型输出准确性与鲁棒性方面的效能,尤其适用于分析模型在MATH500、奥林匹克竞赛题目等高难度数学问题上的推理能力。
衍生相关工作
围绕该数据集衍生的经典研究工作主要集中在数学推理模型的集成方法与输出后处理策略上。例如,基于其提供的加权投票(pred_weighted)和多数投票(pred_maj)等聚合策略,后续研究深入探索了如何利用多个生成路径提升模型在MATH和奥林匹克竞赛等基准上的性能。这些工作进一步推动了如MinervaMath等项目的发展,并在模型自我改进、推理过程验证等领域产生了广泛影响。
数据集最近研究
最新研究方向
在数学推理领域,大型语言模型的能力评估与提升已成为研究热点。该数据集聚焦于Qwen2.5-Math模型的推理过程分析,通过集成多个数学问题集如MATH500和OlympiadBench,探索模型在复杂数学问题上的表现。前沿研究主要围绕推理路径的聚合策略展开,包括加权投票、多数表决等方法的比较,旨在优化模型输出的准确性与稳定性。这些工作不仅推动了数学推理基准的精细化,也为模型自我改进与迭代提供了实证基础,对教育技术与人工智能的交叉应用具有深远意义。
以上内容由遇见数据集搜集并总结生成



