HFXM/LRM-Safety-evaluation-parsed

Name: HFXM/LRM-Safety-evaluation-parsed
Creator: HFXM
Published: 2026-04-26 07:05:12
License: 暂无描述

Hugging Face2026-04-26 更新2026-05-03 收录

下载链接：

https://hf-mirror.com/datasets/HFXM/LRM-Safety-evaluation-parsed

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集包含用于评估多个大型语言模型（LLM）在数学推理任务上性能的数据。数据集由多个分片组成，每个分片对应一个特定模型（如DeepMath_Zero_7B、DeepSeek_R1、Qwen3_4B_Think、claude_haiku_4_5、gemini_3_flash_preview等），每个分片包含41215个示例。每个示例包括提示索引、提示ID、评分模型、查询问题、模型生成内容、评估信息以及基于Claude和Gemini模型的链式思维和答案评分序列。数据集旨在支持数学推理能力的比较和分析，覆盖了开源和闭源模型，总数据集大小约为14.15GB。

This dataset contains data for evaluating the performance of multiple large language models (LLMs) on mathematical reasoning tasks. The dataset consists of multiple splits, each corresponding to a specific model (e.g., DeepMath_Zero_7B, DeepSeek_R1, Qwen3_4B_Think, claude_haiku_4_5, gemini_3_flash_preview, etc.), with each split containing 41,215 examples. Each example includes prompt index, prompt ID, scored model, query, generation, evaluations, and chain-of-thought and answer score sequences based on Claude and Gemini models. The dataset is designed to support comparison and analysis of mathematical reasoning capabilities, covering both open-source and closed-source models, with a total dataset size of approximately 14.15GB.

提供机构：

HFXM

5,000+

优质数据集

54 个

任务类型

进入经典数据集