five

sibasmarakp/Qwen2.5-Math-7B-Instruct-Skywork-o1-Open-PRM-Qwen-2.5-7B-best_of_n-completions

收藏
Hugging Face2026-03-28 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/sibasmarakp/Qwen2.5-Math-7B-Instruct-Skywork-o1-Open-PRM-Qwen-2.5-7B-best_of_n-completions
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: - config_name: minervamath--T-0.7--top_p-0.8--n-8--seed-1--agg_strategy-last features: - name: problem dtype: string - name: answer dtype: string - name: completions list: string - name: scores list: list: float64 - name: pred dtype: string - name: completion_tokens list: int64 - name: agg_scores list: float64 - name: pred_weighted@1 dtype: string - name: pred_maj@1 dtype: string - name: pred_naive@1 dtype: string - name: pred_weighted@2 dtype: string - name: pred_maj@2 dtype: string - name: pred_naive@2 dtype: string - name: pred_weighted@4 dtype: string - name: pred_maj@4 dtype: string - name: pred_naive@4 dtype: string - name: pred_weighted@8 dtype: string - name: pred_maj@8 dtype: string - name: pred_naive@8 dtype: string splits: - name: train num_bytes: 6115933 num_examples: 272 download_size: 5758424 dataset_size: 6115933 - config_name: minervamath--T-0.7--top_p-0.8--n-8--seed-2--agg_strategy-last features: - name: problem dtype: string - name: answer dtype: string - name: completions list: string - name: scores list: list: float64 - name: pred dtype: string - name: completion_tokens list: int64 - name: agg_scores list: float64 - name: pred_weighted@1 dtype: string - name: pred_maj@1 dtype: string - name: pred_naive@1 dtype: string - name: pred_weighted@2 dtype: string - name: pred_maj@2 dtype: string - name: pred_naive@2 dtype: string - name: pred_weighted@4 dtype: string - name: pred_maj@4 dtype: string - name: pred_naive@4 dtype: string - name: pred_weighted@8 dtype: string - name: pred_maj@8 dtype: string - name: pred_naive@8 dtype: string splits: - name: train num_bytes: 6068790 num_examples: 272 download_size: 5690949 dataset_size: 6068790 - config_name: rebuttal-MATH500--T-0.7--top_p-0.8--n-8--seed-0--agg_strategy-last features: - name: problem dtype: string - name: solution dtype: string - name: answer dtype: string - name: subject dtype: string - name: level dtype: int64 - name: unique_id dtype: string - name: completions list: string - name: scores list: list: float64 - name: pred dtype: string - name: completion_tokens list: int64 - name: agg_scores list: float64 - name: pred_weighted@1 dtype: string - name: pred_maj@1 dtype: string - name: pred_naive@1 dtype: string - name: pred_weighted@2 dtype: string - name: pred_maj@2 dtype: string - name: pred_naive@2 dtype: string - name: pred_weighted@4 dtype: string - name: pred_maj@4 dtype: string - name: pred_naive@4 dtype: string - name: pred_weighted@8 dtype: string - name: pred_maj@8 dtype: string - name: pred_naive@8 dtype: string splits: - name: train num_bytes: 9756025 num_examples: 500 download_size: 9071631 dataset_size: 9756025 - config_name: rebuttal-MATH500--T-0.7--top_p-0.8--n-8--seed-0--agg_strategy-last--evals features: - name: n dtype: int64 - name: acc_naive dtype: float64 - name: acc_weighted dtype: float64 - name: acc_maj dtype: float64 splits: - name: train num_bytes: 128 num_examples: 4 download_size: 2224 dataset_size: 128 - config_name: rebuttal-MATH500--T-0.7--top_p-0.8--n-8--seed-1--agg_strategy-last features: - name: problem dtype: string - name: solution dtype: string - name: answer dtype: string - name: subject dtype: string - name: level dtype: int64 - name: unique_id dtype: string - name: completions list: string - name: scores list: list: float64 - name: pred dtype: string - name: completion_tokens list: int64 - name: agg_scores list: float64 - name: pred_weighted@1 dtype: string - name: pred_maj@1 dtype: string - name: pred_naive@1 dtype: string - name: pred_weighted@2 dtype: string - name: pred_maj@2 dtype: string - name: pred_naive@2 dtype: string - name: pred_weighted@4 dtype: string - name: pred_maj@4 dtype: string - name: pred_naive@4 dtype: string - name: pred_weighted@8 dtype: string - name: pred_maj@8 dtype: string - name: pred_naive@8 dtype: string splits: - name: train num_bytes: 9808457 num_examples: 500 download_size: 9121686 dataset_size: 9808457 - config_name: rebuttal-MATH500--T-0.7--top_p-0.8--n-8--seed-1--agg_strategy-last--evals features: - name: n dtype: int64 - name: acc_naive dtype: float64 - name: acc_weighted dtype: float64 - name: acc_maj dtype: float64 splits: - name: train num_bytes: 128 num_examples: 4 download_size: 2225 dataset_size: 128 - config_name: rebuttal-MATH500--T-0.7--top_p-0.8--n-8--seed-2--agg_strategy-last features: - name: problem dtype: string - name: solution dtype: string - name: answer dtype: string - name: subject dtype: string - name: level dtype: int64 - name: unique_id dtype: string - name: completions list: string - name: scores list: list: float64 - name: pred dtype: string - name: completion_tokens list: int64 - name: agg_scores list: float64 - name: pred_weighted@1 dtype: string - name: pred_maj@1 dtype: string - name: pred_naive@1 dtype: string - name: pred_weighted@2 dtype: string - name: pred_maj@2 dtype: string - name: pred_naive@2 dtype: string - name: pred_weighted@4 dtype: string - name: pred_maj@4 dtype: string - name: pred_naive@4 dtype: string - name: pred_weighted@8 dtype: string - name: pred_maj@8 dtype: string - name: pred_naive@8 dtype: string splits: - name: train num_bytes: 9708698 num_examples: 500 download_size: 9020325 dataset_size: 9708698 - config_name: rebuttal-MATH500--T-0.7--top_p-0.8--n-8--seed-2--agg_strategy-last--evals features: - name: n dtype: int64 - name: acc_naive dtype: float64 - name: acc_weighted dtype: float64 - name: acc_maj dtype: float64 splits: - name: train num_bytes: 128 num_examples: 4 download_size: 2224 dataset_size: 128 - config_name: rebuttal-OlympiadBench--T-0.7--top_p-0.8--n-8--seed-0--agg_strategy-last features: - name: id dtype: int64 - name: problem dtype: string - name: solution list: string - name: answer list: string - name: context dtype: 'null' - name: image_1 dtype: 'null' - name: image_2 dtype: 'null' - name: image_3 dtype: 'null' - name: image_4 dtype: 'null' - name: image_5 dtype: 'null' - name: image_6 dtype: 'null' - name: image_7 dtype: 'null' - name: image_8 dtype: 'null' - name: image_9 dtype: 'null' - name: modality dtype: string - name: difficulty dtype: string - name: is_multiple_answer dtype: bool - name: unit dtype: string - name: answer_type dtype: string - name: error dtype: string - name: question_type dtype: string - name: subfield dtype: string - name: subject dtype: string - name: language dtype: string - name: completions list: string - name: scores list: list: float64 - name: pred dtype: string - name: completion_tokens list: int64 - name: agg_scores list: float64 - name: pred_weighted@1 dtype: string - name: pred_maj@1 dtype: string - name: pred_naive@1 dtype: string - name: pred_weighted@2 dtype: string - name: pred_maj@2 dtype: string - name: pred_naive@2 dtype: string - name: pred_weighted@4 dtype: string - name: pred_maj@4 dtype: string - name: pred_naive@4 dtype: string - name: pred_weighted@8 dtype: string - name: pred_maj@8 dtype: string - name: pred_naive@8 dtype: string splits: - name: train num_bytes: 20072189 num_examples: 674 download_size: 18986230 dataset_size: 20072189 - config_name: rebuttal-OlympiadBench--T-0.7--top_p-0.8--n-8--seed-0--agg_strategy-last--evals features: - name: n dtype: int64 - name: acc_naive dtype: float64 - name: acc_weighted dtype: float64 - name: acc_maj dtype: float64 splits: - name: train num_bytes: 128 num_examples: 4 download_size: 2231 dataset_size: 128 - config_name: rebuttal-OlympiadBench--T-0.7--top_p-0.8--n-8--seed-1--agg_strategy-last features: - name: id dtype: int64 - name: problem dtype: string - name: solution list: string - name: answer list: string - name: context dtype: 'null' - name: image_1 dtype: 'null' - name: image_2 dtype: 'null' - name: image_3 dtype: 'null' - name: image_4 dtype: 'null' - name: image_5 dtype: 'null' - name: image_6 dtype: 'null' - name: image_7 dtype: 'null' - name: image_8 dtype: 'null' - name: image_9 dtype: 'null' - name: modality dtype: string - name: difficulty dtype: string - name: is_multiple_answer dtype: bool - name: unit dtype: string - name: answer_type dtype: string - name: error dtype: string - name: question_type dtype: string - name: subfield dtype: string - name: subject dtype: string - name: language dtype: string - name: completions list: string - name: scores list: list: float64 - name: pred dtype: string - name: completion_tokens list: int64 - name: agg_scores list: float64 - name: pred_weighted@1 dtype: string - name: pred_maj@1 dtype: string - name: pred_naive@1 dtype: string - name: pred_weighted@2 dtype: string - name: pred_maj@2 dtype: string - name: pred_naive@2 dtype: string - name: pred_weighted@4 dtype: string - name: pred_maj@4 dtype: string - name: pred_naive@4 dtype: string - name: pred_weighted@8 dtype: string - name: pred_maj@8 dtype: string - name: pred_naive@8 dtype: string splits: - name: train num_bytes: 20005954 num_examples: 674 download_size: 56740641 dataset_size: 20005954 - config_name: rebuttal-OlympiadBench--T-0.7--top_p-0.8--n-8--seed-1--agg_strategy-last--evals features: - name: n dtype: int64 - name: acc_naive dtype: float64 - name: acc_weighted dtype: float64 - name: acc_maj dtype: float64 splits: - name: train num_bytes: 128 num_examples: 4 download_size: 2231 dataset_size: 128 - config_name: rebuttal-OlympiadBench--T-0.7--top_p-0.8--n-8--seed-2--agg_strategy-last features: - name: id dtype: int64 - name: problem dtype: string - name: solution list: string - name: answer list: string - name: context dtype: 'null' - name: image_1 dtype: 'null' - name: image_2 dtype: 'null' - name: image_3 dtype: 'null' - name: image_4 dtype: 'null' - name: image_5 dtype: 'null' - name: image_6 dtype: 'null' - name: image_7 dtype: 'null' - name: image_8 dtype: 'null' - name: image_9 dtype: 'null' - name: modality dtype: string - name: difficulty dtype: string - name: is_multiple_answer dtype: bool - name: unit dtype: string - name: answer_type dtype: string - name: error dtype: string - name: question_type dtype: string - name: subfield dtype: string - name: subject dtype: string - name: language dtype: string - name: completions list: string - name: scores list: list: float64 - name: pred dtype: string - name: completion_tokens list: int64 - name: agg_scores list: float64 - name: pred_weighted@1 dtype: string - name: pred_maj@1 dtype: string - name: pred_naive@1 dtype: string - name: pred_weighted@2 dtype: string - name: pred_maj@2 dtype: string - name: pred_naive@2 dtype: string - name: pred_weighted@4 dtype: string - name: pred_maj@4 dtype: string - name: pred_naive@4 dtype: string - name: pred_weighted@8 dtype: string - name: pred_maj@8 dtype: string - name: pred_naive@8 dtype: string splits: - name: train num_bytes: 20311908 num_examples: 674 download_size: 57752121 dataset_size: 20311908 - config_name: rebuttal-OlympiadBench--T-0.7--top_p-0.8--n-8--seed-2--agg_strategy-last--evals features: - name: n dtype: int64 - name: acc_naive dtype: float64 - name: acc_weighted dtype: float64 - name: acc_maj dtype: float64 splits: - name: train num_bytes: 128 num_examples: 4 download_size: 2231 dataset_size: 128 - config_name: rebuttal-minervamath--T-0.7--top_p-0.8--n-8--seed-0--agg_strategy-last features: - name: problem dtype: string - name: answer dtype: string - name: completions list: string - name: scores list: list: float64 - name: pred dtype: string - name: completion_tokens list: int64 - name: agg_scores list: float64 - name: pred_weighted@1 dtype: string - name: pred_maj@1 dtype: string - name: pred_naive@1 dtype: string - name: pred_weighted@2 dtype: string - name: pred_maj@2 dtype: string - name: pred_naive@2 dtype: string - name: pred_weighted@4 dtype: string - name: pred_maj@4 dtype: string - name: pred_naive@4 dtype: string - name: pred_weighted@8 dtype: string - name: pred_maj@8 dtype: string - name: pred_naive@8 dtype: string splits: - name: train num_bytes: 6101975 num_examples: 272 download_size: 5765845 dataset_size: 6101975 - config_name: rebuttal-minervamath--T-0.7--top_p-0.8--n-8--seed-0--agg_strategy-last--evals features: - name: n dtype: int64 - name: acc_naive dtype: float64 - name: acc_weighted dtype: float64 - name: acc_maj dtype: float64 splits: - name: train num_bytes: 128 num_examples: 4 download_size: 2221 dataset_size: 128 - config_name: rebuttal-minervamath--T-0.7--top_p-0.8--n-8--seed-1--agg_strategy-last features: - name: problem dtype: string - name: answer dtype: string - name: completions list: string - name: scores list: list: float64 - name: pred dtype: string - name: completion_tokens list: int64 - name: agg_scores list: float64 - name: pred_weighted@1 dtype: string - name: pred_maj@1 dtype: string - name: pred_naive@1 dtype: string - name: pred_weighted@2 dtype: string - name: pred_maj@2 dtype: string - name: pred_naive@2 dtype: string - name: pred_weighted@4 dtype: string - name: pred_maj@4 dtype: string - name: pred_naive@4 dtype: string - name: pred_weighted@8 dtype: string - name: pred_maj@8 dtype: string - name: pred_naive@8 dtype: string splits: - name: train num_bytes: 6115933 num_examples: 272 download_size: 5758424 dataset_size: 6115933 - config_name: rebuttal-minervamath--T-0.7--top_p-0.8--n-8--seed-1--agg_strategy-last--evals features: - name: n dtype: int64 - name: acc_naive dtype: float64 - name: acc_weighted dtype: float64 - name: acc_maj dtype: float64 splits: - name: train num_bytes: 128 num_examples: 4 download_size: 2227 dataset_size: 128 - config_name: rebuttal-minervamath--T-0.7--top_p-0.8--n-8--seed-2--agg_strategy-last features: - name: problem dtype: string - name: answer dtype: string - name: completions list: string - name: scores list: list: float64 - name: pred dtype: string - name: completion_tokens list: int64 - name: agg_scores list: float64 - name: pred_weighted@1 dtype: string - name: pred_maj@1 dtype: string - name: pred_naive@1 dtype: string - name: pred_weighted@2 dtype: string - name: pred_maj@2 dtype: string - name: pred_naive@2 dtype: string - name: pred_weighted@4 dtype: string - name: pred_maj@4 dtype: string - name: pred_naive@4 dtype: string - name: pred_weighted@8 dtype: string - name: pred_maj@8 dtype: string - name: pred_naive@8 dtype: string splits: - name: train num_bytes: 6068790 num_examples: 272 download_size: 5690949 dataset_size: 6068790 - config_name: rebuttal-minervamath--T-0.7--top_p-0.8--n-8--seed-2--agg_strategy-last--evals features: - name: n dtype: int64 - name: acc_naive dtype: float64 - name: acc_weighted dtype: float64 - name: acc_maj dtype: float64 splits: - name: train num_bytes: 128 num_examples: 4 download_size: 2229 dataset_size: 128 configs: - config_name: minervamath--T-0.7--top_p-0.8--n-8--seed-1--agg_strategy-last data_files: - split: train path: minervamath--T-0.7--top_p-0.8--n-8--seed-1--agg_strategy-last/train-* - config_name: minervamath--T-0.7--top_p-0.8--n-8--seed-2--agg_strategy-last data_files: - split: train path: minervamath--T-0.7--top_p-0.8--n-8--seed-2--agg_strategy-last/train-* - config_name: rebuttal-MATH500--T-0.7--top_p-0.8--n-8--seed-0--agg_strategy-last data_files: - split: train path: rebuttal-MATH500--T-0.7--top_p-0.8--n-8--seed-0--agg_strategy-last/train-* - config_name: rebuttal-MATH500--T-0.7--top_p-0.8--n-8--seed-0--agg_strategy-last--evals data_files: - split: train path: rebuttal-MATH500--T-0.7--top_p-0.8--n-8--seed-0--agg_strategy-last--evals/train-* - config_name: rebuttal-MATH500--T-0.7--top_p-0.8--n-8--seed-1--agg_strategy-last data_files: - split: train path: rebuttal-MATH500--T-0.7--top_p-0.8--n-8--seed-1--agg_strategy-last/train-* - config_name: rebuttal-MATH500--T-0.7--top_p-0.8--n-8--seed-1--agg_strategy-last--evals data_files: - split: train path: rebuttal-MATH500--T-0.7--top_p-0.8--n-8--seed-1--agg_strategy-last--evals/train-* - config_name: rebuttal-MATH500--T-0.7--top_p-0.8--n-8--seed-2--agg_strategy-last data_files: - split: train path: rebuttal-MATH500--T-0.7--top_p-0.8--n-8--seed-2--agg_strategy-last/train-* - config_name: rebuttal-MATH500--T-0.7--top_p-0.8--n-8--seed-2--agg_strategy-last--evals data_files: - split: train path: rebuttal-MATH500--T-0.7--top_p-0.8--n-8--seed-2--agg_strategy-last--evals/train-* - config_name: rebuttal-OlympiadBench--T-0.7--top_p-0.8--n-8--seed-0--agg_strategy-last data_files: - split: train path: rebuttal-OlympiadBench--T-0.7--top_p-0.8--n-8--seed-0--agg_strategy-last/train-* - config_name: rebuttal-OlympiadBench--T-0.7--top_p-0.8--n-8--seed-0--agg_strategy-last--evals data_files: - split: train path: rebuttal-OlympiadBench--T-0.7--top_p-0.8--n-8--seed-0--agg_strategy-last--evals/train-* - config_name: rebuttal-OlympiadBench--T-0.7--top_p-0.8--n-8--seed-1--agg_strategy-last data_files: - split: train path: rebuttal-OlympiadBench--T-0.7--top_p-0.8--n-8--seed-1--agg_strategy-last/train-* - config_name: rebuttal-OlympiadBench--T-0.7--top_p-0.8--n-8--seed-1--agg_strategy-last--evals data_files: - split: train path: rebuttal-OlympiadBench--T-0.7--top_p-0.8--n-8--seed-1--agg_strategy-last--evals/train-* - config_name: rebuttal-OlympiadBench--T-0.7--top_p-0.8--n-8--seed-2--agg_strategy-last data_files: - split: train path: rebuttal-OlympiadBench--T-0.7--top_p-0.8--n-8--seed-2--agg_strategy-last/train-* - config_name: rebuttal-OlympiadBench--T-0.7--top_p-0.8--n-8--seed-2--agg_strategy-last--evals data_files: - split: train path: rebuttal-OlympiadBench--T-0.7--top_p-0.8--n-8--seed-2--agg_strategy-last--evals/train-* - config_name: rebuttal-minervamath--T-0.7--top_p-0.8--n-8--seed-0--agg_strategy-last data_files: - split: train path: rebuttal-minervamath--T-0.7--top_p-0.8--n-8--seed-0--agg_strategy-last/train-* - config_name: rebuttal-minervamath--T-0.7--top_p-0.8--n-8--seed-0--agg_strategy-last--evals data_files: - split: train path: rebuttal-minervamath--T-0.7--top_p-0.8--n-8--seed-0--agg_strategy-last--evals/train-* - config_name: rebuttal-minervamath--T-0.7--top_p-0.8--n-8--seed-1--agg_strategy-last data_files: - split: train path: rebuttal-minervamath--T-0.7--top_p-0.8--n-8--seed-1--agg_strategy-last/train-* - config_name: rebuttal-minervamath--T-0.7--top_p-0.8--n-8--seed-1--agg_strategy-last--evals data_files: - split: train path: rebuttal-minervamath--T-0.7--top_p-0.8--n-8--seed-1--agg_strategy-last--evals/train-* - config_name: rebuttal-minervamath--T-0.7--top_p-0.8--n-8--seed-2--agg_strategy-last data_files: - split: train path: rebuttal-minervamath--T-0.7--top_p-0.8--n-8--seed-2--agg_strategy-last/train-* - config_name: rebuttal-minervamath--T-0.7--top_p-0.8--n-8--seed-2--agg_strategy-last--evals data_files: - split: train path: rebuttal-minervamath--T-0.7--top_p-0.8--n-8--seed-2--agg_strategy-last--evals/train-* ---
提供机构:
sibasmarakp
搜集汇总
数据集介绍
main_image_url
构建方式
在数学推理领域,数据集的构建往往依赖于对大型语言模型生成能力的系统化探索。该数据集通过Qwen2.5-Math-7B-Instruct模型,在MinervaMath、MATH500和OlympiadBench等多个数学问题基准上,采用温度参数0.7、top-p采样0.8的设置,为每个问题生成8个独立的推理补全序列。这些补全过程在随机种子1和2下重复进行,确保了生成路径的多样性,并通过“last”聚合策略对模型输出进行整合,最终形成包含原始问题、标准答案、多个补全序列及其对应评分的高质量数据集合。
特点
该数据集的核心特征在于其多层次的结构化设计,不仅涵盖了从基础到竞赛级别的广泛数学问题,还提供了每个问题对应的多路径推理轨迹。每个数据条目均包含详细的元信息,如问题描述、标准答案、补全序列列表、评分矩阵以及基于不同聚合策略的预测结果。特别地,数据集引入了加权投票、多数投票和朴素选择等多种答案聚合方法,并针对不同数量的补全样本(如1、2、4、8)提供了相应的预测性能评估,为研究模型决策的稳健性与一致性提供了丰富维度。
使用方法
研究人员可利用该数据集深入探究大型语言模型在数学推理任务中的行为模式。通过加载特定的配置,如“rebuttal-MATH500--T-0.7--top_p-0.8--n-8--seed-0--agg_strategy-last”,可以访问对应基准的完整生成数据。数据集支持对模型补全序列进行质量分析,比较不同聚合策略在准确率上的差异,并借助附带的评估配置直接获取模型在不同采样规模下的性能指标。这些功能使得该数据集成为评估和提升模型数学推理能力、研究不确定性建模以及集成方法有效性的重要资源。
背景与挑战
背景概述
在人工智能领域,数学推理能力是衡量大型语言模型智能水平的关键维度。Qwen2.5-Math-7B-Instruct-Skywork-o1-Open-PRM-Qwen-2.5-7B-best_of_n-completions数据集应运而生,旨在评估和提升模型在复杂数学问题上的解决能力。该数据集由Skywork等研究机构构建,基于Qwen2.5-7B模型,通过集成过程奖励模型(PRM)和best-of-n采样策略,生成多组数学问题解答。其核心研究问题聚焦于如何通过多路径推理与答案聚合,增强模型在数学竞赛和高级数学问题上的准确性与鲁棒性。该数据集不仅推动了数学推理领域的技术进步,还为模型优化提供了宝贵的基准资源。
当前挑战
该数据集致力于解决数学推理领域的核心挑战,即提升模型在复杂、多步骤数学问题中的准确性和泛化能力。具体挑战包括:模型需处理多样化的数学子领域,如代数、几何与数论,并应对高难度竞赛题目的抽象逻辑;同时,构建过程中面临数据质量控制的难题,例如确保生成解答的多样性与正确性之间的平衡,以及设计有效的答案聚合策略(如加权投票、多数投票)以减少随机性误差。此外,数据集的规模与计算资源消耗也构成了实际构建的瓶颈,需要在有限样本下实现可靠的性能评估。
常用场景
经典使用场景
在数学推理领域,该数据集通过整合多个数学问题求解任务,为大型语言模型的推理能力评估提供了经典范例。其核心在于利用Qwen2.5-Math-7B-Instruct模型生成多组候选答案,并借助评分机制筛选最优解,从而系统化地测试模型在复杂数学问题上的表现。这一过程不仅涵盖了基础数学题目,还延伸至奥林匹克竞赛级别的高难度问题,为研究者提供了衡量模型数学思维深度的标准化工具。
衍生相关工作
围绕该数据集衍生的经典工作主要集中在推理聚合算法的优化与跨领域迁移。例如,研究者基于其加权评分机制提出了动态阈值选择方法,提升了模型在开放式数学问题上的泛化能力。另有工作将其评估框架扩展至物理、化学等科学推理任务,验证了类似方法在结构化问题求解中的普适性,促进了多学科交叉的评估体系构建。
数据集最近研究
最新研究方向
在数学推理领域,大型语言模型的能力评估与优化已成为研究焦点。该数据集通过整合MinervaMath、MATH500及OlympiadBench等数学问题集,并采用多轮采样与评分聚合策略,为模型输出质量的分析提供了丰富素材。当前研究正聚焦于探索不同聚合方法(如加权平均、多数投票)对模型预测准确性的影响,旨在揭示模型在复杂数学问题上的推理稳定性与泛化能力。这一方向不仅响应了学术界对模型可解释性与可靠性的迫切需求,也为后续的模型微调与算法改进奠定了实证基础。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作