sibasmarakp/Qwen2.5-Math-7B-Instruct-Qwen2.5-14B-Instruct-SupervisedPRM-T80-adapters-best_of_n-completions
收藏Hugging Face2026-03-28 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/sibasmarakp/Qwen2.5-Math-7B-Instruct-Qwen2.5-14B-Instruct-SupervisedPRM-T80-adapters-best_of_n-completions
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
- config_name: rebuttal-MATH500--T-0.7--top_p-0.8--n-8--seed-0--agg_strategy-last
features:
- name: problem
dtype: string
- name: solution
dtype: string
- name: answer
dtype: string
- name: subject
dtype: string
- name: level
dtype: int64
- name: unique_id
dtype: string
- name: completions
list: string
- name: scores
list:
list: float64
- name: pred
dtype: string
- name: completion_tokens
list: int64
- name: agg_scores
list: float64
- name: pred_weighted@1
dtype: string
- name: pred_maj@1
dtype: string
- name: pred_naive@1
dtype: string
- name: pred_weighted@2
dtype: string
- name: pred_maj@2
dtype: string
- name: pred_naive@2
dtype: string
- name: pred_weighted@4
dtype: string
- name: pred_maj@4
dtype: string
- name: pred_naive@4
dtype: string
- name: pred_weighted@8
dtype: string
- name: pred_maj@8
dtype: string
- name: pred_naive@8
dtype: string
splits:
- name: train
num_bytes: 8907827
num_examples: 500
download_size: 8201919
dataset_size: 8907827
- config_name: rebuttal-MATH500--T-0.7--top_p-0.8--n-8--seed-0--agg_strategy-last--evals
features:
- name: n
dtype: int64
- name: acc_naive
dtype: float64
- name: acc_weighted
dtype: float64
- name: acc_maj
dtype: float64
splits:
- name: train
num_bytes: 128
num_examples: 4
download_size: 2216
dataset_size: 128
- config_name: rebuttal-MATH500--T-0.7--top_p-0.8--n-8--seed-1--agg_strategy-last
features:
- name: problem
dtype: string
- name: solution
dtype: string
- name: answer
dtype: string
- name: subject
dtype: string
- name: level
dtype: int64
- name: unique_id
dtype: string
- name: completions
list: string
- name: scores
list:
list: float64
- name: pred
dtype: string
- name: completion_tokens
list: int64
- name: agg_scores
list: float64
- name: pred_weighted@1
dtype: string
- name: pred_maj@1
dtype: string
- name: pred_naive@1
dtype: string
- name: pred_weighted@2
dtype: string
- name: pred_maj@2
dtype: string
- name: pred_naive@2
dtype: string
- name: pred_weighted@4
dtype: string
- name: pred_maj@4
dtype: string
- name: pred_naive@4
dtype: string
- name: pred_weighted@8
dtype: string
- name: pred_maj@8
dtype: string
- name: pred_naive@8
dtype: string
splits:
- name: train
num_bytes: 8945754
num_examples: 500
download_size: 8242349
dataset_size: 8945754
- config_name: rebuttal-MATH500--T-0.7--top_p-0.8--n-8--seed-1--agg_strategy-last--evals
features:
- name: n
dtype: int64
- name: acc_naive
dtype: float64
- name: acc_weighted
dtype: float64
- name: acc_maj
dtype: float64
splits:
- name: train
num_bytes: 128
num_examples: 4
download_size: 2215
dataset_size: 128
- config_name: rebuttal-MATH500--T-0.7--top_p-0.8--n-8--seed-2--agg_strategy-last
features:
- name: problem
dtype: string
- name: solution
dtype: string
- name: answer
dtype: string
- name: subject
dtype: string
- name: level
dtype: int64
- name: unique_id
dtype: string
- name: completions
list: string
- name: scores
list:
list: float64
- name: pred
dtype: string
- name: completion_tokens
list: int64
- name: agg_scores
list: float64
- name: pred_weighted@1
dtype: string
- name: pred_maj@1
dtype: string
- name: pred_naive@1
dtype: string
- name: pred_weighted@2
dtype: string
- name: pred_maj@2
dtype: string
- name: pred_naive@2
dtype: string
- name: pred_weighted@4
dtype: string
- name: pred_maj@4
dtype: string
- name: pred_naive@4
dtype: string
- name: pred_weighted@8
dtype: string
- name: pred_maj@8
dtype: string
- name: pred_naive@8
dtype: string
splits:
- name: train
num_bytes: 8890288
num_examples: 500
download_size: 8165910
dataset_size: 8890288
- config_name: rebuttal-MATH500--T-0.7--top_p-0.8--n-8--seed-2--agg_strategy-last--evals
features:
- name: n
dtype: int64
- name: acc_naive
dtype: float64
- name: acc_weighted
dtype: float64
- name: acc_maj
dtype: float64
splits:
- name: train
num_bytes: 128
num_examples: 4
download_size: 2217
dataset_size: 128
- config_name: rebuttal-OlympiadBench--T-0.7--top_p-0.8--n-8--seed-0--agg_strategy-last
features:
- name: id
dtype: int64
- name: problem
dtype: string
- name: solution
list: string
- name: answer
list: string
- name: context
dtype: 'null'
- name: image_1
dtype: 'null'
- name: image_2
dtype: 'null'
- name: image_3
dtype: 'null'
- name: image_4
dtype: 'null'
- name: image_5
dtype: 'null'
- name: image_6
dtype: 'null'
- name: image_7
dtype: 'null'
- name: image_8
dtype: 'null'
- name: image_9
dtype: 'null'
- name: modality
dtype: string
- name: difficulty
dtype: string
- name: is_multiple_answer
dtype: bool
- name: unit
dtype: string
- name: answer_type
dtype: string
- name: error
dtype: string
- name: question_type
dtype: string
- name: subfield
dtype: string
- name: subject
dtype: string
- name: language
dtype: string
- name: completions
list: string
- name: scores
list:
list: float64
- name: pred
dtype: string
- name: completion_tokens
list: int64
- name: agg_scores
list: float64
- name: pred_weighted@1
dtype: string
- name: pred_maj@1
dtype: string
- name: pred_naive@1
dtype: string
- name: pred_weighted@2
dtype: string
- name: pred_maj@2
dtype: string
- name: pred_naive@2
dtype: string
- name: pred_weighted@4
dtype: string
- name: pred_maj@4
dtype: string
- name: pred_naive@4
dtype: string
- name: pred_weighted@8
dtype: string
- name: pred_maj@8
dtype: string
- name: pred_naive@8
dtype: string
splits:
- name: train
num_bytes: 19152794
num_examples: 674
download_size: 54117216
dataset_size: 19152794
- config_name: rebuttal-OlympiadBench--T-0.7--top_p-0.8--n-8--seed-0--agg_strategy-last--evals
features:
- name: n
dtype: int64
- name: acc_naive
dtype: float64
- name: acc_weighted
dtype: float64
- name: acc_maj
dtype: float64
splits:
- name: train
num_bytes: 128
num_examples: 4
download_size: 2223
dataset_size: 128
- config_name: rebuttal-OlympiadBench--T-0.7--top_p-0.8--n-8--seed-1--agg_strategy-last
features:
- name: id
dtype: int64
- name: problem
dtype: string
- name: solution
list: string
- name: answer
list: string
- name: context
dtype: 'null'
- name: image_1
dtype: 'null'
- name: image_2
dtype: 'null'
- name: image_3
dtype: 'null'
- name: image_4
dtype: 'null'
- name: image_5
dtype: 'null'
- name: image_6
dtype: 'null'
- name: image_7
dtype: 'null'
- name: image_8
dtype: 'null'
- name: image_9
dtype: 'null'
- name: modality
dtype: string
- name: difficulty
dtype: string
- name: is_multiple_answer
dtype: bool
- name: unit
dtype: string
- name: answer_type
dtype: string
- name: error
dtype: string
- name: question_type
dtype: string
- name: subfield
dtype: string
- name: subject
dtype: string
- name: language
dtype: string
- name: completions
list: string
- name: scores
list:
list: float64
- name: pred
dtype: string
- name: completion_tokens
list: int64
- name: agg_scores
list: float64
- name: pred_weighted@1
dtype: string
- name: pred_maj@1
dtype: string
- name: pred_naive@1
dtype: string
- name: pred_weighted@2
dtype: string
- name: pred_maj@2
dtype: string
- name: pred_naive@2
dtype: string
- name: pred_weighted@4
dtype: string
- name: pred_maj@4
dtype: string
- name: pred_naive@4
dtype: string
- name: pred_weighted@8
dtype: string
- name: pred_maj@8
dtype: string
- name: pred_naive@8
dtype: string
splits:
- name: train
num_bytes: 19032707
num_examples: 674
download_size: 53854413
dataset_size: 19032707
- config_name: rebuttal-OlympiadBench--T-0.7--top_p-0.8--n-8--seed-1--agg_strategy-last--evals
features:
- name: n
dtype: int64
- name: acc_naive
dtype: float64
- name: acc_weighted
dtype: float64
- name: acc_maj
dtype: float64
splits:
- name: train
num_bytes: 128
num_examples: 4
download_size: 2231
dataset_size: 128
- config_name: rebuttal-OlympiadBench--T-0.7--top_p-0.8--n-8--seed-2--agg_strategy-last
features:
- name: id
dtype: int64
- name: problem
dtype: string
- name: solution
list: string
- name: answer
list: string
- name: context
dtype: 'null'
- name: image_1
dtype: 'null'
- name: image_2
dtype: 'null'
- name: image_3
dtype: 'null'
- name: image_4
dtype: 'null'
- name: image_5
dtype: 'null'
- name: image_6
dtype: 'null'
- name: image_7
dtype: 'null'
- name: image_8
dtype: 'null'
- name: image_9
dtype: 'null'
- name: modality
dtype: string
- name: difficulty
dtype: string
- name: is_multiple_answer
dtype: bool
- name: unit
dtype: string
- name: answer_type
dtype: string
- name: error
dtype: string
- name: question_type
dtype: string
- name: subfield
dtype: string
- name: subject
dtype: string
- name: language
dtype: string
- name: completions
list: string
- name: scores
list:
list: float64
- name: pred
dtype: string
- name: completion_tokens
list: int64
- name: agg_scores
list: float64
- name: pred_weighted@1
dtype: string
- name: pred_maj@1
dtype: string
- name: pred_naive@1
dtype: string
- name: pred_weighted@2
dtype: string
- name: pred_maj@2
dtype: string
- name: pred_naive@2
dtype: string
- name: pred_weighted@4
dtype: string
- name: pred_maj@4
dtype: string
- name: pred_naive@4
dtype: string
- name: pred_weighted@8
dtype: string
- name: pred_maj@8
dtype: string
- name: pred_naive@8
dtype: string
splits:
- name: train
num_bytes: 19252700
num_examples: 674
download_size: 54393078
dataset_size: 19252700
- config_name: rebuttal-OlympiadBench--T-0.7--top_p-0.8--n-8--seed-2--agg_strategy-last--evals
features:
- name: n
dtype: int64
- name: acc_naive
dtype: float64
- name: acc_weighted
dtype: float64
- name: acc_maj
dtype: float64
splits:
- name: train
num_bytes: 128
num_examples: 4
download_size: 2231
dataset_size: 128
- config_name: rebuttal-minervamath--T-0.7--top_p-0.8--n-8--seed-0--agg_strategy-last
features:
- name: problem
dtype: string
- name: answer
dtype: string
- name: completions
list: string
- name: scores
list:
list: float64
- name: pred
dtype: string
- name: completion_tokens
list: int64
- name: agg_scores
list: float64
- name: pred_weighted@1
dtype: string
- name: pred_maj@1
dtype: string
- name: pred_naive@1
dtype: string
- name: pred_weighted@2
dtype: string
- name: pred_maj@2
dtype: string
- name: pred_naive@2
dtype: string
- name: pred_weighted@4
dtype: string
- name: pred_maj@4
dtype: string
- name: pred_naive@4
dtype: string
- name: pred_weighted@8
dtype: string
- name: pred_maj@8
dtype: string
- name: pred_naive@8
dtype: string
splits:
- name: train
num_bytes: 5788223
num_examples: 272
download_size: 5428225
dataset_size: 5788223
- config_name: rebuttal-minervamath--T-0.7--top_p-0.8--n-8--seed-0--agg_strategy-last--evals
features:
- name: n
dtype: int64
- name: acc_naive
dtype: float64
- name: acc_weighted
dtype: float64
- name: acc_maj
dtype: float64
splits:
- name: train
num_bytes: 128
num_examples: 4
download_size: 2227
dataset_size: 128
- config_name: rebuttal-minervamath--T-0.7--top_p-0.8--n-8--seed-1--agg_strategy-last
features:
- name: problem
dtype: string
- name: answer
dtype: string
- name: completions
list: string
- name: scores
list:
list: float64
- name: pred
dtype: string
- name: completion_tokens
list: int64
- name: agg_scores
list: float64
- name: pred_weighted@1
dtype: string
- name: pred_maj@1
dtype: string
- name: pred_naive@1
dtype: string
- name: pred_weighted@2
dtype: string
- name: pred_maj@2
dtype: string
- name: pred_naive@2
dtype: string
- name: pred_weighted@4
dtype: string
- name: pred_maj@4
dtype: string
- name: pred_naive@4
dtype: string
- name: pred_weighted@8
dtype: string
- name: pred_maj@8
dtype: string
- name: pred_naive@8
dtype: string
splits:
- name: train
num_bytes: 5816896
num_examples: 272
download_size: 5456758
dataset_size: 5816896
- config_name: rebuttal-minervamath--T-0.7--top_p-0.8--n-8--seed-1--agg_strategy-last--evals
features:
- name: n
dtype: int64
- name: acc_naive
dtype: float64
- name: acc_weighted
dtype: float64
- name: acc_maj
dtype: float64
splits:
- name: train
num_bytes: 128
num_examples: 4
download_size: 2234
dataset_size: 128
- config_name: rebuttal-minervamath--T-0.7--top_p-0.8--n-8--seed-2--agg_strategy-last
features:
- name: problem
dtype: string
- name: answer
dtype: string
- name: completions
list: string
- name: scores
list:
list: float64
- name: pred
dtype: string
- name: completion_tokens
list: int64
- name: agg_scores
list: float64
- name: pred_weighted@1
dtype: string
- name: pred_maj@1
dtype: string
- name: pred_naive@1
dtype: string
- name: pred_weighted@2
dtype: string
- name: pred_maj@2
dtype: string
- name: pred_naive@2
dtype: string
- name: pred_weighted@4
dtype: string
- name: pred_maj@4
dtype: string
- name: pred_naive@4
dtype: string
- name: pred_weighted@8
dtype: string
- name: pred_maj@8
dtype: string
- name: pred_naive@8
dtype: string
splits:
- name: train
num_bytes: 5795235
num_examples: 272
download_size: 5402120
dataset_size: 5795235
- config_name: rebuttal-minervamath--T-0.7--top_p-0.8--n-8--seed-2--agg_strategy-last--evals
features:
- name: n
dtype: int64
- name: acc_naive
dtype: float64
- name: acc_weighted
dtype: float64
- name: acc_maj
dtype: float64
splits:
- name: train
num_bytes: 128
num_examples: 4
download_size: 2228
dataset_size: 128
configs:
- config_name: rebuttal-MATH500--T-0.7--top_p-0.8--n-8--seed-0--agg_strategy-last
data_files:
- split: train
path: rebuttal-MATH500--T-0.7--top_p-0.8--n-8--seed-0--agg_strategy-last/train-*
- config_name: rebuttal-MATH500--T-0.7--top_p-0.8--n-8--seed-0--agg_strategy-last--evals
data_files:
- split: train
path: rebuttal-MATH500--T-0.7--top_p-0.8--n-8--seed-0--agg_strategy-last--evals/train-*
- config_name: rebuttal-MATH500--T-0.7--top_p-0.8--n-8--seed-1--agg_strategy-last
data_files:
- split: train
path: rebuttal-MATH500--T-0.7--top_p-0.8--n-8--seed-1--agg_strategy-last/train-*
- config_name: rebuttal-MATH500--T-0.7--top_p-0.8--n-8--seed-1--agg_strategy-last--evals
data_files:
- split: train
path: rebuttal-MATH500--T-0.7--top_p-0.8--n-8--seed-1--agg_strategy-last--evals/train-*
- config_name: rebuttal-MATH500--T-0.7--top_p-0.8--n-8--seed-2--agg_strategy-last
data_files:
- split: train
path: rebuttal-MATH500--T-0.7--top_p-0.8--n-8--seed-2--agg_strategy-last/train-*
- config_name: rebuttal-MATH500--T-0.7--top_p-0.8--n-8--seed-2--agg_strategy-last--evals
data_files:
- split: train
path: rebuttal-MATH500--T-0.7--top_p-0.8--n-8--seed-2--agg_strategy-last--evals/train-*
- config_name: rebuttal-OlympiadBench--T-0.7--top_p-0.8--n-8--seed-0--agg_strategy-last
data_files:
- split: train
path: rebuttal-OlympiadBench--T-0.7--top_p-0.8--n-8--seed-0--agg_strategy-last/train-*
- config_name: rebuttal-OlympiadBench--T-0.7--top_p-0.8--n-8--seed-0--agg_strategy-last--evals
data_files:
- split: train
path: rebuttal-OlympiadBench--T-0.7--top_p-0.8--n-8--seed-0--agg_strategy-last--evals/train-*
- config_name: rebuttal-OlympiadBench--T-0.7--top_p-0.8--n-8--seed-1--agg_strategy-last
data_files:
- split: train
path: rebuttal-OlympiadBench--T-0.7--top_p-0.8--n-8--seed-1--agg_strategy-last/train-*
- config_name: rebuttal-OlympiadBench--T-0.7--top_p-0.8--n-8--seed-1--agg_strategy-last--evals
data_files:
- split: train
path: rebuttal-OlympiadBench--T-0.7--top_p-0.8--n-8--seed-1--agg_strategy-last--evals/train-*
- config_name: rebuttal-OlympiadBench--T-0.7--top_p-0.8--n-8--seed-2--agg_strategy-last
data_files:
- split: train
path: rebuttal-OlympiadBench--T-0.7--top_p-0.8--n-8--seed-2--agg_strategy-last/train-*
- config_name: rebuttal-OlympiadBench--T-0.7--top_p-0.8--n-8--seed-2--agg_strategy-last--evals
data_files:
- split: train
path: rebuttal-OlympiadBench--T-0.7--top_p-0.8--n-8--seed-2--agg_strategy-last--evals/train-*
- config_name: rebuttal-minervamath--T-0.7--top_p-0.8--n-8--seed-0--agg_strategy-last
data_files:
- split: train
path: rebuttal-minervamath--T-0.7--top_p-0.8--n-8--seed-0--agg_strategy-last/train-*
- config_name: rebuttal-minervamath--T-0.7--top_p-0.8--n-8--seed-0--agg_strategy-last--evals
data_files:
- split: train
path: rebuttal-minervamath--T-0.7--top_p-0.8--n-8--seed-0--agg_strategy-last--evals/train-*
- config_name: rebuttal-minervamath--T-0.7--top_p-0.8--n-8--seed-1--agg_strategy-last
data_files:
- split: train
path: rebuttal-minervamath--T-0.7--top_p-0.8--n-8--seed-1--agg_strategy-last/train-*
- config_name: rebuttal-minervamath--T-0.7--top_p-0.8--n-8--seed-1--agg_strategy-last--evals
data_files:
- split: train
path: rebuttal-minervamath--T-0.7--top_p-0.8--n-8--seed-1--agg_strategy-last--evals/train-*
- config_name: rebuttal-minervamath--T-0.7--top_p-0.8--n-8--seed-2--agg_strategy-last
data_files:
- split: train
path: rebuttal-minervamath--T-0.7--top_p-0.8--n-8--seed-2--agg_strategy-last/train-*
- config_name: rebuttal-minervamath--T-0.7--top_p-0.8--n-8--seed-2--agg_strategy-last--evals
data_files:
- split: train
path: rebuttal-minervamath--T-0.7--top_p-0.8--n-8--seed-2--agg_strategy-last--evals/train-*
---
提供机构:
sibasmarakp
搜集汇总
数据集介绍

构建方式
在数学推理领域,数据集的构建往往需要精细的采样与评估策略。本数据集通过整合MATH500、OlympiadBench和minervamath三个知名数学问题基准,采用温度参数0.7、top-p参数0.8的采样策略,为每个问题生成8个候选解答。每个解答由Qwen2.5-Math-7B-Instruct模型生成,并由Qwen2.5-14B-Instruct模型通过监督式偏好奖励模型进行评分。构建过程中,针对不同随机种子(0、1、2)进行了多次采样,以确保结果的稳健性,并采用“last”聚合策略对评分进行整合,最终形成包含问题、解答、评分及多种预测结果的结构化数据。
特点
该数据集的特点体现在其多层次、多维度的数学问题覆盖与深度评估体系。它不仅涵盖了从中学到竞赛级别的数学题目,包括代数、几何等多个学科,还提供了每个问题的标准答案、详细解题步骤以及模型生成的多种候选解答。每个候选解答均附有精细的评分列表,这些评分源自监督式偏好奖励模型,能够量化解答的质量。数据集进一步提供了基于加权、多数投票和朴素策略的多种预测结果,支持在不同采样数量(如1、2、4、8)下的性能评估,为研究数学推理模型的输出多样性与准确性提供了丰富的数据基础。
使用方法
该数据集的使用方法聚焦于数学推理模型的评估与比较研究。研究人员可以加载特定配置的数据,例如针对MATH500或OlympiadBench的采样结果,直接访问问题、候选解答及其评分。通过分析不同聚合策略(如pred_weighted@n、pred_maj@n)产生的预测答案,可以评估模型在最佳解答选择上的性能。数据集内包含的评估结果(如acc_naive、acc_weighted)允许用户直接比较不同采样策略的准确率。此外,该数据集适用于研究温度与top-p参数对生成多样性的影响,或作为训练数据用于改进奖励模型与推理策略,推动数学人工智能的发展。
背景与挑战
背景概述
在大型语言模型(LLM)的数学推理能力评估领域,Qwen2.5-Math-7B-Instruct-Qwen2.5-14B-Instruct-SupervisedPRM-T80-adapters-best_of_n-completions数据集应运而生,旨在系统性地探索模型在复杂数学问题求解中的表现。该数据集由Qwen团队构建,核心研究问题聚焦于如何通过监督式偏好奖励模型(Supervised PRM)与适配器(adapters)技术,提升模型在MATH500、OlympiadBench及minervamath等权威数学基准上的推理精度。其创建标志着数学推理评估从单一答案匹配向多步生成与偏好评分融合的范式转变,为模型微调与集成策略提供了关键数据支撑,推动了数学智能向更高阶逻辑思维迈进。
当前挑战
该数据集致力于解决数学推理中模型生成答案的可靠性与一致性挑战,具体体现为如何从多个候选解中筛选出最优答案。构建过程中,首要挑战在于数学问题本身具有高度结构化与抽象性,要求生成内容不仅正确还需符合严格的逻辑推导规范。其次,集成多个基准(如MATH500与OlympiadBench)时,需统一不同来源的问题格式与评分标准,确保数据的一致性与可比性。此外,监督式偏好奖励模型的训练依赖高质量的人工标注偏好对,标注成本高昂且易受主观偏差影响。最后,适配器技术的引入虽提升灵活性,但如何平衡参数效率与模型性能仍是一个待优化的难题。
常用场景
经典使用场景
在数学推理领域,大型语言模型的性能评估与优化是研究核心。该数据集通过整合MATH500、OlympiadBench和MinervaMath等多个数学问题基准,提供了丰富的多轮生成结果与评分数据。其经典使用场景在于系统性地评估和比较不同聚合策略(如加权投票、多数投票)在提升模型数学问题解答准确性方面的效果,为研究者提供了量化分析模型输出多样性与可靠性的实验平台。
衍生相关工作
围绕该数据集衍生的经典工作主要集中在集成学习与投票策略的优化上。例如,基于加权评分聚合的答案选择方法被广泛应用于后续的数学推理模型评估中,启发了诸如Self-Consistency等技术的改进。同时,该数据集也为研究生成多样性对模型性能的影响提供了数据支持,促进了如Best-of-N采样等策略在大型语言模型推理任务中的深入探索与创新。
数据集最近研究
最新研究方向
在数学推理领域,大型语言模型的性能评估与优化已成为前沿研究焦点。该数据集通过整合MATH500、OlympiadBench及MinervaMath等权威数学问题集,并采用监督式偏好奖励模型与最佳N采样策略,系统探索了模型在复杂数学问题上的生成质量与评分机制。研究重点在于比较不同聚合策略(如加权平均、多数投票及朴素选择)对最终预测准确率的影响,旨在揭示模型自我评估能力与外部评分的一致性。这一方向不仅推动了数学推理基准的精细化发展,也为模型校准与可信人工智能提供了实证基础,对提升模型在科学计算与教育应用中的可靠性具有深远意义。
以上内容由遇见数据集搜集并总结生成



