five

PairJudge-432K

收藏
魔搭社区2025-10-09 更新2025-07-12 收录
下载链接:
https://modelscope.cn/datasets/THU-KEG/PairJudge-432K
下载链接
链接失效反馈
官方服务:
资源简介:
# PAIRJUDGE-432K Dataset **PAIRJUDGE-432K** is a large-scale dataset containing 432K annotated pairwise judgments designed for training reward models in mathematical reasoning tasks. Each sample in the dataset is a prompt–completion pair where the prompt includes a math problem and two candidate solutions, and the completion is a chain-of-thought reasoning that evaluates the correctness of the two responses. - Paper: https://arxiv.org/abs/2501.13007 - Code: https://github.com/THU-KEG/PairJudgeRM - Model: https://huggingface.co/THU-KEG/PairJudge-432K ## Overview - **Purpose:** To provide high-quality annotated pairwise comparisons for training models like PairJudge RM, enabling robust Best-of-N sampling through pairwise judgment. - **Content:** 432K samples derived from the NumiaMath dataset and annotated with gemini-1.5-flash. - **Format:** Each example is a prompt–completion pair: - **Prompt:** Contains a math problem *x* along with two candidate solutions (*y₁* and *y₂*). - **Completion:** A chain-of-thought reasoning text that verifies the correctness of each candidate and outputs corresponding correctness labels (*c₁* and *c₂*). ## Dataset Format & Structure The dataset is organized into prompt–completion pairs that follow a consistent template: - **Prompt Template:** - **Question:** A math problem drawn from diverse sources (e.g., AMC/AIME, AoPS Forum, Chinese K-12, Olympiad problems). - **Responses:** Two candidate solutions generated by large language models. - **Completion:** - A detailed chain-of-thought reasoning process that: - Verifies each step of the candidate solutions. - Checks mathematical accuracy, logical consistency, and completeness. - Provides final correctness judgments for both responses. ## Construction Pipeline 1. **Data Collection:** - Math problems were collected from the NumiaMath dataset, which includes problems from various educational and competitive sources. 2. **Candidate Generation:** - For each math problem, multiple candidate solutions were generated (using models such as Llama-3.1-8B-Instruct) to provide a diverse set of responses. 3. **Annotation via Gemini-1.5-Flash:** - The candidate solutions were then evaluated in pairwise settings using gemini-1.5-flash. - A knockout tournament process was conducted to record pairwise judgments, where each match compares two candidate solutions. 4. **Filtering & Finalization:** - Out of over 2.2M recorded comparisons, 1.3M high-quality judgments were retained. - Further filtering was applied to ensure adherence to the specified prompt template, resulting in the final 432K high-quality training samples. ## Dataset Statistics The dataset aggregates problems from multiple sources. Although the original NumiaMath dataset includes nearly 860K problems, strict filtering for clarity, formatting, and response quality resulted in the final 432K examples. Key statistics include: - **Total Samples:** 432K prompt–completion pairs. - **Sources:** Problems from AMC/AIME, AoPS Forum, Chinese K-12, Olympiad, and other math problem datasets. - **Annotations:** Each pair includes detailed chain-of-thought reasoning along with binary correctness labels for the two candidate responses. ## Usage The PAIRJUDGE-432K dataset is ideal for: - Training and fine-tuning reward models that perform pairwise comparisons. - Enhancing Best-of-N sampling techniques in mathematical reasoning tasks. - Research in model evaluation where transparent reasoning processes are required. ### Example Code ```python from datasets import load_dataset # Load the PAIRJUDGE-432K dataset from Hugging Face Datasets dataset = load_dataset("THU-KEG/PairJudge-432K") # View an example print(dataset["train"][0]) ``` ## Citation ``` @article{liu2025PairJudge, title={PairJudge RM: Perform Best-of-N Sampling with Knockout Tournament}, author={Liu, Yantao and Yao, Zijun and Min, Rui and Cao, Yixin and Hou, Lei and Li, Juanzi}, journal={arXiv preprint arXiv:2501.13007}, year={2025}, note={in progress work}, url={https://doi.org/10.48550/arXiv.2501.13007} } ```

# PAIRJUDGE-432K 数据集 **PAIRJUDGE-432K** 是一个大规模数据集,包含43.2万条带标注的成对评判样本,专为数学推理任务中的奖励模型训练设计。该数据集的每个样本均为一组提示词-补全对(prompt–completion pair),其中提示词包含一道数学题与两个候选解答,补全部分则为链式思考推理(chain-of-thought reasoning)文本,用于评判两个候选解答的正确性。 - Paper: https://arxiv.org/abs/2501.13007 - Code: https://github.com/THU-KEG/PairJudgeRM - Model: https://huggingface.co/THU-KEG/PairJudge-432K ## 概述 - **用途**:为训练PairJudge RM等模型提供高质量的带标注成对比较数据,支持通过成对评判实现稳健的N选最佳采样(Best-of-N sampling)。 - **内容**:43.2万条样本均源自NumiaMath数据集,并由gemini-1.5-flash完成标注。 - **格式**:每个样本均为一组提示词-补全对: - **提示词**:包含一道数学题*x*与两个候选解答*y₁*和*y₂*。 - **补全文本**:包含链式思考推理文本,用于验证每个候选解答的正确性,并输出对应的正确性标签*c₁*与*c₂*。 ## 数据集格式与结构 该数据集采用统一模板组织为提示词-补全对: - **提示词模板**: - **问题**:源自多种渠道的数学题(例如AMC/AIME竞赛题、AoPS论坛题、中国K-12习题、奥林匹克竞赛题等)。 - **候选解答**:由大语言模型生成的两个候选解答。 - **补全文本**: 一段详细的链式思考推理流程,用于: - 验证候选解答的每一步推导; - 校验数学准确性、逻辑一致性与解答完整性; - 为两个候选解答给出最终正确性评判。 ## 构建流程 1. **数据采集**: 数学题源自NumiaMath数据集,该数据集涵盖了各类教育与竞赛场景下的习题。 2. **候选解答生成**: 针对每道数学题,通过Llama-3.1-8B-Instruct等模型生成多个候选解答,以构建多样化的响应集合。 3. **Gemini-1.5-Flash标注**: 随后使用gemini-1.5-flash以成对比较的方式对候选解答进行评估。 采用淘汰赛流程记录成对评判结果,每场对决均对两个候选解答进行比较。 4. **筛选与最终定稿**: 在超过220万条已记录的比较结果中,保留了130万条高质量评判数据。 随后进一步筛选以确保符合预设的提示词模板规范,最终得到43.2万条高质量训练样本。 ## 数据集统计信息 该数据集整合了多渠道的数学题。尽管原始NumiaMath数据集包含近86万道习题,但经过对清晰度、格式与解答质量的严格筛选,最终得到43.2万条样本。关键统计信息如下: - **总样本量**:43.2万组提示词-补全对。 - **数据来源**:涵盖AMC/AIME竞赛题、AoPS论坛题、中国K-12习题、奥林匹克竞赛题及其他数学习题集。 - **标注内容**:每组样本均包含详细的链式思考推理文本,以及两个候选解答的二元正确性标签。 ## 使用场景 PAIRJUDGE-432K数据集适用于以下场景: - 训练与微调支持成对比较的奖励模型; - 优化数学推理任务中的N选最佳采样(Best-of-N sampling)技术; - 开展需要透明推理流程的模型评估相关研究。 ### 示例代码 python from datasets import load_dataset # 从Hugging Face Datasets加载PAIRJUDGE-432K数据集 dataset = load_dataset("THU-KEG/PairJudge-432K") # 查看一条样本 print(dataset["train"][0]) ## 引用格式 @article{liu2025PairJudge, title={PairJudge RM: Perform Best-of-N Sampling with Knockout Tournament}, author={Liu, Yantao and Yao, Zijun and Min, Rui and Cao, Yixin and Hou, Lei and Li, Juanzi}, journal={arXiv preprint arXiv:2501.13007}, year={2025}, note={in progress work}, url={https://doi.org/10.48550/arXiv.2501.13007} }
提供机构:
maas
创建时间:
2025-07-11
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作