HFXM/LRM-Safety-evaluation-parsed
收藏Hugging Face2026-04-26 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/HFXM/LRM-Safety-evaluation-parsed
下载链接
链接失效反馈官方服务:
资源简介:
该数据集包含用于评估多个大型语言模型(LLM)在数学推理任务上性能的数据。数据集由多个分片组成,每个分片对应一个特定模型(如DeepMath_Zero_7B、DeepSeek_R1、Qwen3_4B_Think、claude_haiku_4_5、gemini_3_flash_preview等),每个分片包含41215个示例。每个示例包括提示索引、提示ID、评分模型、查询问题、模型生成内容、评估信息以及基于Claude和Gemini模型的链式思维和答案评分序列。数据集旨在支持数学推理能力的比较和分析,覆盖了开源和闭源模型,总数据集大小约为14.15GB。
This dataset contains data for evaluating the performance of multiple large language models (LLMs) on mathematical reasoning tasks. The dataset consists of multiple splits, each corresponding to a specific model (e.g., DeepMath_Zero_7B, DeepSeek_R1, Qwen3_4B_Think, claude_haiku_4_5, gemini_3_flash_preview, etc.), with each split containing 41,215 examples. Each example includes prompt index, prompt ID, scored model, query, generation, evaluations, and chain-of-thought and answer score sequences based on Claude and Gemini models. The dataset is designed to support comparison and analysis of mathematical reasoning capabilities, covering both open-source and closed-source models, with a total dataset size of approximately 14.15GB.
提供机构:
HFXM



