launch/thinkprm-1K-verification-cots

Name: launch/thinkprm-1K-verification-cots
Creator: launch
Published: 2026-04-18 23:46:17
License: 暂无描述

Hugging Face2026-04-18 更新2025-05-31 收录

下载链接：

https://hf-mirror.com/datasets/launch/thinkprm-1K-verification-cots

下载链接

链接失效反馈

官方服务：

资源简介：

ThinkPRM-Synthetic-Verification-1K数据集包含1000个高质量合成验证链式思维（CoTs），用于训练生成性过程奖励模型（PRMs），如论文《Process Reward Models that Think》中所使用。该数据集旨在创建一个高效的数据替代品，以替代传统PRM训练中通常需要的大量人工注释或昂贵的回放。每个实例由一个数学问题、一个相应的多步骤解决方案前缀（来自PRM800K数据集）和一个由QwQ-32B-Preview模型生成的详细验证CoTs组成。验证CoTs对解决方案前缀的每一步进行批评并提供步骤级别的正确性判断。为确保合成CoTs的高质量，只保留了所有步骤级别判断与PRM800K数据集的人类注释匹配的链。它们还根据正确的格式和长度约束进行了过滤，以避免在未过滤生成中观察到的过度思考等问题。

The ThinkPRM-Synthetic-Verification-1K dataset contains 1,000 high-quality synthetic verification chains-of-thought (CoTs) designed for training generative Process Reward Models (PRMs), as used in the paper Process Reward Models that Think. The dataset aims to create a data-efficient alternative to traditional PRM training, which often requires extensive human annotation or expensive rollouts. Each instance consists of a math problem, a corresponding multi-step solution prefix (sourced from PRM800K), and a detailed verification CoT generated by the QwQ-32B-Preview model. The verification CoT critiques each step of the solution prefix and provides a step-level correctness judgment. High-quality synthetic CoTs are ensured by retaining only chains where all step-level judgments matched the ground-truth human annotations from the PRM800K dataset and filtering based on correct formatting and length constraints.

提供机构：

launch

5,000+

优质数据集

54 个

任务类型

进入经典数据集