llm-compe-2025-kato/step2-evaluated-dataset-Qwen3-14B-cp40
收藏Hugging Face2025-08-21 更新2025-09-13 收录
下载链接:
https://hf-mirror.com/datasets/llm-compe-2025-kato/step2-evaluated-dataset-Qwen3-14B-cp40
下载链接
链接失效反馈官方服务:
资源简介:
完整评估数据集(Rubric + LogP)包含使用综合评分量表评估和LogP评估的链式思维解释。该数据集来源于llm-compe-2025-kato/step2-evaluated-dataset-Qwen3-14B-cp40,共有58个样本,其中53个样本使用Rubric方法成功评估,5个样本评估失败。评估模型为Qwen/Qwen3-32B。数据集的结构包括系统提示、原始问题、正确答案、生成的解释、详细的评分量表评估结果、加权评分量表分数(0-1比例)和LogP评估分数。该数据集可用于训练奖励模型、评估推理能力、研究评分量表分数与LogP分数之间的关系以及开发更好的数学推理评估指标。
The Complete Evaluation Dataset (Rubric + LogP) contains chain-of-thought explanations evaluated using both comprehensive rubric assessment and LogP evaluation. It is sourced from llm-compe-2025-kato/step2-evaluated-dataset-Qwen3-14B-cp40, with a total of 58 samples, 53 of which were successfully evaluated using the Rubric method, and 5 failed evaluations. The evaluation model used is Qwen/Qwen3-32B. The dataset structure includes system prompts, original questions, correct answers, generated explanations, detailed rubric evaluation results, weighted rubric scores (on a 0-1 scale), and LogP evaluation scores. This dataset can be used for training reward models, evaluating reasoning capabilities, studying the relationship between rubric scores and LogP scores, and developing better evaluation metrics for mathematical reasoning.
提供机构:
llm-compe-2025-kato



