five

sibasmarakp/Qwen2.5-Math-7B-Instruct-Qwen2.5-Math-PRM-7B-best_of_n-completions

收藏
Hugging Face2026-03-28 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/sibasmarakp/Qwen2.5-Math-7B-Instruct-Qwen2.5-Math-PRM-7B-best_of_n-completions
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: - config_name: minervamath--T-0.7--top_p-0.8--n-8--seed-0--agg_strategy-last features: - name: problem dtype: string - name: answer dtype: string - name: completions list: string - name: scores list: list: float64 - name: pred dtype: string - name: completion_tokens list: int64 - name: agg_scores list: float64 - name: pred_weighted@1 dtype: string - name: pred_maj@1 dtype: string - name: pred_naive@1 dtype: string - name: pred_weighted@2 dtype: string - name: pred_maj@2 dtype: string - name: pred_naive@2 dtype: string - name: pred_weighted@4 dtype: string - name: pred_maj@4 dtype: string - name: pred_naive@4 dtype: string - name: pred_weighted@8 dtype: string - name: pred_maj@8 dtype: string - name: pred_naive@8 dtype: string splits: - name: train num_bytes: 5567372 num_examples: 272 download_size: 5227662 dataset_size: 5567372 - config_name: minervamath--T-0.7--top_p-0.8--n-8--seed-1--agg_strategy-last features: - name: problem dtype: string - name: answer dtype: string - name: completions list: string - name: scores list: list: float64 - name: pred dtype: string - name: completion_tokens list: int64 - name: agg_scores list: float64 - name: pred_weighted@1 dtype: string - name: pred_maj@1 dtype: string - name: pred_naive@1 dtype: string - name: pred_weighted@2 dtype: string - name: pred_maj@2 dtype: string - name: pred_naive@2 dtype: string - name: pred_weighted@4 dtype: string - name: pred_maj@4 dtype: string - name: pred_naive@4 dtype: string - name: pred_weighted@8 dtype: string - name: pred_maj@8 dtype: string - name: pred_naive@8 dtype: string splits: - name: train num_bytes: 5575761 num_examples: 272 download_size: 5209226 dataset_size: 5575761 - config_name: minervamath--T-0.7--top_p-0.8--n-8--seed-2--agg_strategy-last features: - name: problem dtype: string - name: answer dtype: string - name: completions list: string - name: scores list: list: float64 - name: pred dtype: string - name: completion_tokens list: int64 - name: agg_scores list: float64 - name: pred_weighted@1 dtype: string - name: pred_maj@1 dtype: string - name: pred_naive@1 dtype: string - name: pred_weighted@2 dtype: string - name: pred_maj@2 dtype: string - name: pred_naive@2 dtype: string - name: pred_weighted@4 dtype: string - name: pred_maj@4 dtype: string - name: pred_naive@4 dtype: string - name: pred_weighted@8 dtype: string - name: pred_maj@8 dtype: string - name: pred_naive@8 dtype: string splits: - name: train num_bytes: 5523999 num_examples: 272 download_size: 5142967 dataset_size: 5523999 - config_name: rebuttal-MATH500--T-0.7--top_p-0.8--n-8--seed-0--agg_strategy-last features: - name: problem dtype: string - name: solution dtype: string - name: answer dtype: string - name: subject dtype: string - name: level dtype: int64 - name: unique_id dtype: string - name: completions list: string - name: scores list: list: float64 - name: pred dtype: string - name: completion_tokens list: int64 - name: agg_scores list: float64 - name: pred_weighted@1 dtype: string - name: pred_maj@1 dtype: string - name: pred_naive@1 dtype: string - name: pred_weighted@2 dtype: string - name: pred_maj@2 dtype: string - name: pred_naive@2 dtype: string - name: pred_weighted@4 dtype: string - name: pred_maj@4 dtype: string - name: pred_naive@4 dtype: string - name: pred_weighted@8 dtype: string - name: pred_maj@8 dtype: string - name: pred_naive@8 dtype: string splits: - name: train num_bytes: 8692078 num_examples: 500 download_size: 8003652 dataset_size: 8692078 - config_name: rebuttal-MATH500--T-0.7--top_p-0.8--n-8--seed-0--agg_strategy-last--evals features: - name: n dtype: int64 - name: acc_naive dtype: float64 - name: acc_weighted dtype: float64 - name: acc_maj dtype: float64 splits: - name: train num_bytes: 128 num_examples: 4 download_size: 2225 dataset_size: 128 - config_name: rebuttal-MATH500--T-0.7--top_p-0.8--n-8--seed-1--agg_strategy-last features: - name: problem dtype: string - name: solution dtype: string - name: answer dtype: string - name: subject dtype: string - name: level dtype: int64 - name: unique_id dtype: string - name: completions list: string - name: scores list: list: float64 - name: pred dtype: string - name: completion_tokens list: int64 - name: agg_scores list: float64 - name: pred_weighted@1 dtype: string - name: pred_maj@1 dtype: string - name: pred_naive@1 dtype: string - name: pred_weighted@2 dtype: string - name: pred_maj@2 dtype: string - name: pred_naive@2 dtype: string - name: pred_weighted@4 dtype: string - name: pred_maj@4 dtype: string - name: pred_naive@4 dtype: string - name: pred_weighted@8 dtype: string - name: pred_maj@8 dtype: string - name: pred_naive@8 dtype: string splits: - name: train num_bytes: 8716071 num_examples: 500 download_size: 8034556 dataset_size: 8716071 - config_name: rebuttal-MATH500--T-0.7--top_p-0.8--n-8--seed-1--agg_strategy-last--evals features: - name: n dtype: int64 - name: acc_naive dtype: float64 - name: acc_weighted dtype: float64 - name: acc_maj dtype: float64 splits: - name: train num_bytes: 128 num_examples: 4 download_size: 2219 dataset_size: 128 - config_name: rebuttal-MATH500--T-0.7--top_p-0.8--n-8--seed-2--agg_strategy-last features: - name: problem dtype: string - name: solution dtype: string - name: answer dtype: string - name: subject dtype: string - name: level dtype: int64 - name: unique_id dtype: string - name: completions list: string - name: scores list: list: float64 - name: pred dtype: string - name: completion_tokens list: int64 - name: agg_scores list: float64 - name: pred_weighted@1 dtype: string - name: pred_maj@1 dtype: string - name: pred_naive@1 dtype: string - name: pred_weighted@2 dtype: string - name: pred_maj@2 dtype: string - name: pred_naive@2 dtype: string - name: pred_weighted@4 dtype: string - name: pred_maj@4 dtype: string - name: pred_naive@4 dtype: string - name: pred_weighted@8 dtype: string - name: pred_maj@8 dtype: string - name: pred_naive@8 dtype: string splits: - name: train num_bytes: 8634089 num_examples: 500 download_size: 7942429 dataset_size: 8634089 - config_name: rebuttal-MATH500--T-0.7--top_p-0.8--n-8--seed-2--agg_strategy-last--evals features: - name: n dtype: int64 - name: acc_naive dtype: float64 - name: acc_weighted dtype: float64 - name: acc_maj dtype: float64 splits: - name: train num_bytes: 128 num_examples: 4 download_size: 2221 dataset_size: 128 - config_name: rebuttal-OlympiadBench--T-0.7--top_p-0.8--n-8--seed-0--agg_strategy-last features: - name: id dtype: int64 - name: problem dtype: string - name: solution list: string - name: answer list: string - name: context dtype: 'null' - name: image_1 dtype: 'null' - name: image_2 dtype: 'null' - name: image_3 dtype: 'null' - name: image_4 dtype: 'null' - name: image_5 dtype: 'null' - name: image_6 dtype: 'null' - name: image_7 dtype: 'null' - name: image_8 dtype: 'null' - name: image_9 dtype: 'null' - name: modality dtype: string - name: difficulty dtype: string - name: is_multiple_answer dtype: bool - name: unit dtype: string - name: answer_type dtype: string - name: error dtype: string - name: question_type dtype: string - name: subfield dtype: string - name: subject dtype: string - name: language dtype: string - name: completions list: string - name: scores list: list: float64 - name: pred dtype: string - name: completion_tokens list: int64 - name: agg_scores list: float64 - name: pred_weighted@1 dtype: string - name: pred_maj@1 dtype: string - name: pred_naive@1 dtype: string - name: pred_weighted@2 dtype: string - name: pred_maj@2 dtype: string - name: pred_naive@2 dtype: string - name: pred_weighted@4 dtype: string - name: pred_maj@4 dtype: string - name: pred_naive@4 dtype: string - name: pred_weighted@8 dtype: string - name: pred_maj@8 dtype: string - name: pred_naive@8 dtype: string splits: - name: train num_bytes: 18274797 num_examples: 674 download_size: 17197166 dataset_size: 18274797 - config_name: rebuttal-OlympiadBench--T-0.7--top_p-0.8--n-8--seed-0--agg_strategy-last--evals features: - name: n dtype: int64 - name: acc_naive dtype: float64 - name: acc_weighted dtype: float64 - name: acc_maj dtype: float64 splits: - name: train num_bytes: 128 num_examples: 4 download_size: 2231 dataset_size: 128 - config_name: rebuttal-OlympiadBench--T-0.7--top_p-0.8--n-8--seed-1--agg_strategy-last features: - name: id dtype: int64 - name: problem dtype: string - name: solution list: string - name: answer list: string - name: context dtype: 'null' - name: image_1 dtype: 'null' - name: image_2 dtype: 'null' - name: image_3 dtype: 'null' - name: image_4 dtype: 'null' - name: image_5 dtype: 'null' - name: image_6 dtype: 'null' - name: image_7 dtype: 'null' - name: image_8 dtype: 'null' - name: image_9 dtype: 'null' - name: modality dtype: string - name: difficulty dtype: string - name: is_multiple_answer dtype: bool - name: unit dtype: string - name: answer_type dtype: string - name: error dtype: string - name: question_type dtype: string - name: subfield dtype: string - name: subject dtype: string - name: language dtype: string - name: completions list: string - name: scores list: list: float64 - name: pred dtype: string - name: completion_tokens list: int64 - name: agg_scores list: float64 - name: pred_weighted@1 dtype: string - name: pred_maj@1 dtype: string - name: pred_naive@1 dtype: string - name: pred_weighted@2 dtype: string - name: pred_maj@2 dtype: string - name: pred_naive@2 dtype: string - name: pred_weighted@4 dtype: string - name: pred_maj@4 dtype: string - name: pred_naive@4 dtype: string - name: pred_weighted@8 dtype: string - name: pred_maj@8 dtype: string - name: pred_naive@8 dtype: string splits: - name: train num_bytes: 18174947 num_examples: 674 download_size: 17100163 dataset_size: 18174947 - config_name: rebuttal-OlympiadBench--T-0.7--top_p-0.8--n-8--seed-1--agg_strategy-last--evals features: - name: n dtype: int64 - name: acc_naive dtype: float64 - name: acc_weighted dtype: float64 - name: acc_maj dtype: float64 splits: - name: train num_bytes: 128 num_examples: 4 download_size: 2223 dataset_size: 128 - config_name: rebuttal-OlympiadBench--T-0.7--top_p-0.8--n-8--seed-2--agg_strategy-last features: - name: id dtype: int64 - name: problem dtype: string - name: solution list: string - name: answer list: string - name: context dtype: 'null' - name: image_1 dtype: 'null' - name: image_2 dtype: 'null' - name: image_3 dtype: 'null' - name: image_4 dtype: 'null' - name: image_5 dtype: 'null' - name: image_6 dtype: 'null' - name: image_7 dtype: 'null' - name: image_8 dtype: 'null' - name: image_9 dtype: 'null' - name: modality dtype: string - name: difficulty dtype: string - name: is_multiple_answer dtype: bool - name: unit dtype: string - name: answer_type dtype: string - name: error dtype: string - name: question_type dtype: string - name: subfield dtype: string - name: subject dtype: string - name: language dtype: string - name: completions list: string - name: scores list: list: float64 - name: pred dtype: string - name: completion_tokens list: int64 - name: agg_scores list: float64 - name: pred_weighted@1 dtype: string - name: pred_maj@1 dtype: string - name: pred_naive@1 dtype: string - name: pred_weighted@2 dtype: string - name: pred_maj@2 dtype: string - name: pred_naive@2 dtype: string - name: pred_weighted@4 dtype: string - name: pred_maj@4 dtype: string - name: pred_naive@4 dtype: string - name: pred_weighted@8 dtype: string - name: pred_maj@8 dtype: string - name: pred_naive@8 dtype: string splits: - name: train num_bytes: 18429551 num_examples: 674 download_size: 17355301 dataset_size: 18429551 - config_name: rebuttal-OlympiadBench--T-0.7--top_p-0.8--n-8--seed-2--agg_strategy-last--evals features: - name: n dtype: int64 - name: acc_naive dtype: float64 - name: acc_weighted dtype: float64 - name: acc_maj dtype: float64 splits: - name: train num_bytes: 128 num_examples: 4 download_size: 2231 dataset_size: 128 - config_name: rebuttal-minervamath--T-0.7--top_p-0.8--n-8--seed-0--agg_strategy-last features: - name: problem dtype: string - name: answer dtype: string - name: completions list: string - name: scores list: list: float64 - name: pred dtype: string - name: completion_tokens list: int64 - name: agg_scores list: float64 - name: pred_weighted@1 dtype: string - name: pred_maj@1 dtype: string - name: pred_naive@1 dtype: string - name: pred_weighted@2 dtype: string - name: pred_maj@2 dtype: string - name: pred_naive@2 dtype: string - name: pred_weighted@4 dtype: string - name: pred_maj@4 dtype: string - name: pred_naive@4 dtype: string - name: pred_weighted@8 dtype: string - name: pred_maj@8 dtype: string - name: pred_naive@8 dtype: string splits: - name: train num_bytes: 5567372 num_examples: 272 download_size: 5227662 dataset_size: 5567372 - config_name: rebuttal-minervamath--T-0.7--top_p-0.8--n-8--seed-0--agg_strategy-last--evals features: - name: n dtype: int64 - name: acc_naive dtype: float64 - name: acc_weighted dtype: float64 - name: acc_maj dtype: float64 splits: - name: train num_bytes: 128 num_examples: 4 download_size: 2229 dataset_size: 128 - config_name: rebuttal-minervamath--T-0.7--top_p-0.8--n-8--seed-1--agg_strategy-last features: - name: problem dtype: string - name: answer dtype: string - name: completions list: string - name: scores list: list: float64 - name: pred dtype: string - name: completion_tokens list: int64 - name: agg_scores list: float64 - name: pred_weighted@1 dtype: string - name: pred_maj@1 dtype: string - name: pred_naive@1 dtype: string - name: pred_weighted@2 dtype: string - name: pred_maj@2 dtype: string - name: pred_naive@2 dtype: string - name: pred_weighted@4 dtype: string - name: pred_maj@4 dtype: string - name: pred_naive@4 dtype: string - name: pred_weighted@8 dtype: string - name: pred_maj@8 dtype: string - name: pred_naive@8 dtype: string splits: - name: train num_bytes: 5575761 num_examples: 272 download_size: 5209226 dataset_size: 5575761 - config_name: rebuttal-minervamath--T-0.7--top_p-0.8--n-8--seed-1--agg_strategy-last--evals features: - name: n dtype: int64 - name: acc_naive dtype: float64 - name: acc_weighted dtype: float64 - name: acc_maj dtype: float64 splits: - name: train num_bytes: 128 num_examples: 4 download_size: 2226 dataset_size: 128 - config_name: rebuttal-minervamath--T-0.7--top_p-0.8--n-8--seed-2--agg_strategy-last features: - name: problem dtype: string - name: answer dtype: string - name: completions list: string - name: scores list: list: float64 - name: pred dtype: string - name: completion_tokens list: int64 - name: agg_scores list: float64 - name: pred_weighted@1 dtype: string - name: pred_maj@1 dtype: string - name: pred_naive@1 dtype: string - name: pred_weighted@2 dtype: string - name: pred_maj@2 dtype: string - name: pred_naive@2 dtype: string - name: pred_weighted@4 dtype: string - name: pred_maj@4 dtype: string - name: pred_naive@4 dtype: string - name: pred_weighted@8 dtype: string - name: pred_maj@8 dtype: string - name: pred_naive@8 dtype: string splits: - name: train num_bytes: 5523999 num_examples: 272 download_size: 5142967 dataset_size: 5523999 - config_name: rebuttal-minervamath--T-0.7--top_p-0.8--n-8--seed-2--agg_strategy-last--evals features: - name: n dtype: int64 - name: acc_naive dtype: float64 - name: acc_weighted dtype: float64 - name: acc_maj dtype: float64 splits: - name: train num_bytes: 128 num_examples: 4 download_size: 2227 dataset_size: 128 configs: - config_name: minervamath--T-0.7--top_p-0.8--n-8--seed-0--agg_strategy-last data_files: - split: train path: minervamath--T-0.7--top_p-0.8--n-8--seed-0--agg_strategy-last/train-* - config_name: minervamath--T-0.7--top_p-0.8--n-8--seed-1--agg_strategy-last data_files: - split: train path: minervamath--T-0.7--top_p-0.8--n-8--seed-1--agg_strategy-last/train-* - config_name: minervamath--T-0.7--top_p-0.8--n-8--seed-2--agg_strategy-last data_files: - split: train path: minervamath--T-0.7--top_p-0.8--n-8--seed-2--agg_strategy-last/train-* - config_name: rebuttal-MATH500--T-0.7--top_p-0.8--n-8--seed-0--agg_strategy-last data_files: - split: train path: rebuttal-MATH500--T-0.7--top_p-0.8--n-8--seed-0--agg_strategy-last/train-* - config_name: rebuttal-MATH500--T-0.7--top_p-0.8--n-8--seed-0--agg_strategy-last--evals data_files: - split: train path: rebuttal-MATH500--T-0.7--top_p-0.8--n-8--seed-0--agg_strategy-last--evals/train-* - config_name: rebuttal-MATH500--T-0.7--top_p-0.8--n-8--seed-1--agg_strategy-last data_files: - split: train path: rebuttal-MATH500--T-0.7--top_p-0.8--n-8--seed-1--agg_strategy-last/train-* - config_name: rebuttal-MATH500--T-0.7--top_p-0.8--n-8--seed-1--agg_strategy-last--evals data_files: - split: train path: rebuttal-MATH500--T-0.7--top_p-0.8--n-8--seed-1--agg_strategy-last--evals/train-* - config_name: rebuttal-MATH500--T-0.7--top_p-0.8--n-8--seed-2--agg_strategy-last data_files: - split: train path: rebuttal-MATH500--T-0.7--top_p-0.8--n-8--seed-2--agg_strategy-last/train-* - config_name: rebuttal-MATH500--T-0.7--top_p-0.8--n-8--seed-2--agg_strategy-last--evals data_files: - split: train path: rebuttal-MATH500--T-0.7--top_p-0.8--n-8--seed-2--agg_strategy-last--evals/train-* - config_name: rebuttal-OlympiadBench--T-0.7--top_p-0.8--n-8--seed-0--agg_strategy-last data_files: - split: train path: rebuttal-OlympiadBench--T-0.7--top_p-0.8--n-8--seed-0--agg_strategy-last/train-* - config_name: rebuttal-OlympiadBench--T-0.7--top_p-0.8--n-8--seed-0--agg_strategy-last--evals data_files: - split: train path: rebuttal-OlympiadBench--T-0.7--top_p-0.8--n-8--seed-0--agg_strategy-last--evals/train-* - config_name: rebuttal-OlympiadBench--T-0.7--top_p-0.8--n-8--seed-1--agg_strategy-last data_files: - split: train path: rebuttal-OlympiadBench--T-0.7--top_p-0.8--n-8--seed-1--agg_strategy-last/train-* - config_name: rebuttal-OlympiadBench--T-0.7--top_p-0.8--n-8--seed-1--agg_strategy-last--evals data_files: - split: train path: rebuttal-OlympiadBench--T-0.7--top_p-0.8--n-8--seed-1--agg_strategy-last--evals/train-* - config_name: rebuttal-OlympiadBench--T-0.7--top_p-0.8--n-8--seed-2--agg_strategy-last data_files: - split: train path: rebuttal-OlympiadBench--T-0.7--top_p-0.8--n-8--seed-2--agg_strategy-last/train-* - config_name: rebuttal-OlympiadBench--T-0.7--top_p-0.8--n-8--seed-2--agg_strategy-last--evals data_files: - split: train path: rebuttal-OlympiadBench--T-0.7--top_p-0.8--n-8--seed-2--agg_strategy-last--evals/train-* - config_name: rebuttal-minervamath--T-0.7--top_p-0.8--n-8--seed-0--agg_strategy-last data_files: - split: train path: rebuttal-minervamath--T-0.7--top_p-0.8--n-8--seed-0--agg_strategy-last/train-* - config_name: rebuttal-minervamath--T-0.7--top_p-0.8--n-8--seed-0--agg_strategy-last--evals data_files: - split: train path: rebuttal-minervamath--T-0.7--top_p-0.8--n-8--seed-0--agg_strategy-last--evals/train-* - config_name: rebuttal-minervamath--T-0.7--top_p-0.8--n-8--seed-1--agg_strategy-last data_files: - split: train path: rebuttal-minervamath--T-0.7--top_p-0.8--n-8--seed-1--agg_strategy-last/train-* - config_name: rebuttal-minervamath--T-0.7--top_p-0.8--n-8--seed-1--agg_strategy-last--evals data_files: - split: train path: rebuttal-minervamath--T-0.7--top_p-0.8--n-8--seed-1--agg_strategy-last--evals/train-* - config_name: rebuttal-minervamath--T-0.7--top_p-0.8--n-8--seed-2--agg_strategy-last data_files: - split: train path: rebuttal-minervamath--T-0.7--top_p-0.8--n-8--seed-2--agg_strategy-last/train-* - config_name: rebuttal-minervamath--T-0.7--top_p-0.8--n-8--seed-2--agg_strategy-last--evals data_files: - split: train path: rebuttal-minervamath--T-0.7--top_p-0.8--n-8--seed-2--agg_strategy-last--evals/train-* ---
提供机构:
sibasmarakp
搜集汇总
数据集介绍
main_image_url
构建方式
在数学推理领域,提升大型语言模型的解题能力是当前研究的热点。该数据集通过Qwen2.5-Math-7B-Instruct模型,在多个数学基准上生成多样化的解题补全序列,并利用Qwen2.5-Math-PRM-7B模型进行评分与聚合。具体构建过程涉及对MinervaMath、MATH500和OlympiadBench等数学问题集,采用温度采样与top-p截断策略,为每个问题生成多个候选解答,再通过奖励模型评估每个解答的质量分数,最终依据不同聚合策略得出最优预测。
特点
该数据集的核心特征在于其结构化地记录了模型推理过程中的多路径探索与评估结果。每个数据样本不仅包含原始数学问题与标准答案,还详尽保存了模型生成的多个补全序列、对应的评分列表以及基于加权、多数表决等不同策略的聚合预测。这种设计使得数据集能够支持对模型不确定性、答案一致性以及评分机制有效性的深入分析,为研究数学推理中的集成方法与自我改进提供了丰富的实验素材。
使用方法
研究者可利用该数据集进行多方面的探索,例如分析不同聚合策略对最终答案准确率的影响,或探究奖励模型评分与答案正确性之间的关联。数据集中的补全序列与评分信息可直接用于训练或评估新的答案选择或集成学习算法。通过加载特定的配置,用户可以访问不同种子或基准下的数据,利用`pred_weighted@n`、`pred_maj@n`等字段比较不同采样规模下的性能,或结合评估配置中的准确率指标进行模型行为的量化研究。
背景与挑战
背景概述
在大型语言模型数学推理能力评估领域,Qwen2.5-Math-7B-Instruct-Qwen2.5-Math-PRM-7B-best_of_n-completions数据集应运而生,旨在系统性地探索模型在复杂数学问题求解中的表现。该数据集由Qwen团队构建,依托于MinervaMath、MATH500及OlympiadBench等多个权威数学基准,通过生成多个候选答案并利用偏好评分模型进行排序,核心研究问题聚焦于提升模型输出的准确性与可靠性。其创建标志着数学推理评估从单一答案匹配迈向多路径择优的新阶段,为模型决策过程的可解释性与鲁棒性研究提供了关键数据支撑。
当前挑战
该数据集致力于解决数学推理中模型输出不稳定与错误率高的核心挑战,通过多答案生成与评分机制优化最终预测。构建过程中面临多重困难:首先,数学问题涵盖代数、几何、奥赛等多领域,需确保问题多样性与难度分层;其次,生成多个候选答案时需平衡创造性探索与逻辑一致性,避免无效或重复输出;再者,偏好评分模型的训练依赖高质量的人类反馈数据,其标注成本高昂且易引入主观偏差;最后,不同聚合策略(如加权投票、多数表决)的性能评估需在多个种子设置下进行稳健性验证,以保障结论的统计显著性。
常用场景
经典使用场景
在数学推理与大型语言模型评估领域,该数据集通过提供多个数学问题的生成式完成序列及其评分,为研究者深入探究模型在复杂数学任务中的表现提供了关键资源。其经典使用场景在于系统评估和比较不同聚合策略(如加权投票、多数投票和朴素选择)在提升模型输出准确性与鲁棒性方面的效能,尤其适用于分析模型在MATH500、奥林匹克竞赛题目等高难度数学问题上的推理能力。
衍生相关工作
围绕该数据集衍生的经典研究工作主要集中在数学推理模型的集成方法与输出后处理策略上。例如,基于其提供的加权投票(pred_weighted)和多数投票(pred_maj)等聚合策略,后续研究深入探索了如何利用多个生成路径提升模型在MATH和奥林匹克竞赛等基准上的性能。这些工作进一步推动了如MinervaMath等项目的发展,并在模型自我改进、推理过程验证等领域产生了广泛影响。
数据集最近研究
最新研究方向
在数学推理领域,大型语言模型的能力评估与提升已成为研究热点。该数据集聚焦于Qwen2.5-Math模型的推理过程分析,通过集成多个数学问题集如MATH500和OlympiadBench,探索模型在复杂数学问题上的表现。前沿研究主要围绕推理路径的聚合策略展开,包括加权投票、多数表决等方法的比较,旨在优化模型输出的准确性与稳定性。这些工作不仅推动了数学推理基准的精细化,也为模型自我改进与迭代提供了实证基础,对教育技术与人工智能的交叉应用具有深远意义。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作