lmarena-ai/PPE-MATH-Best-of-K

Name: lmarena-ai/PPE-MATH-Best-of-K
Creator: lmarena-ai
Published: 2024-10-22 07:59:58
License: 暂无描述

Hugging Face2024-10-22 更新2025-04-12 收录

下载链接：

https://hf-mirror.com/datasets/lmarena-ai/PPE-MATH-Best-of-K

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: question_id dtype: string - name: problem dtype: string - name: level dtype: string - name: type dtype: string - name: solution dtype: string - name: sanitized_solution dtype: string - name: model_name dtype: string - name: prompt dtype: string - name: scores sequence: bool - name: parsed_outputs sequence: string - name: mean_score dtype: float64 - name: response_1 dtype: string - name: response_2 dtype: string - name: response_3 dtype: string - name: response_4 dtype: string - name: response_5 dtype: string - name: response_6 dtype: string - name: response_7 dtype: string - name: response_8 dtype: string - name: response_9 dtype: string - name: response_10 dtype: string - name: response_11 dtype: string - name: response_12 dtype: string - name: response_13 dtype: string - name: response_14 dtype: string - name: response_15 dtype: string - name: response_16 dtype: string - name: response_17 dtype: string - name: response_18 dtype: string - name: response_19 dtype: string - name: response_20 dtype: string - name: response_21 dtype: string - name: response_22 dtype: string - name: response_23 dtype: string - name: response_24 dtype: string - name: response_25 dtype: string - name: response_26 dtype: string - name: response_27 dtype: string - name: response_28 dtype: string - name: response_29 dtype: string - name: response_30 dtype: string - name: response_31 dtype: string - name: response_32 dtype: string - name: conflict_pairs sequence: sequence: int64 - name: sampled_conflict_pairs sequence: sequence: int64 splits: - name: train num_bytes: 28121544 num_examples: 512 download_size: 12452688 dataset_size: 28121544 configs: - config_name: default data_files: - split: train path: data/train-* --- # Overview This contains the MATH correctness preference evaluation set for Preference Proxy Evaluations. The prompts are sampled from [MATH](https://huggingface.co/datasets/hendrycks/competition_math). This dataset is meant for benchmarking and evaluation, not for training. [Paper](https://arxiv.org/abs/2410.14872) [Code](https://github.com/lmarena/PPE) # License User prompts are licensed under MIT, and model outputs are governed by the terms of use set by the respective model providers. # Citation ``` @misc{frick2024evaluaterewardmodelsrlhf, title={How to Evaluate Reward Models for RLHF}, author={Evan Frick and Tianle Li and Connor Chen and Wei-Lin Chiang and Anastasios N. Angelopoulos and Jiantao Jiao and Banghua Zhu and Joseph E. Gonzalez and Ion Stoica}, year={2024}, eprint={2410.14872}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2410.14872}, } ```

数据集信息：特征字段： - 名称：问题ID（question_id），数据类型：字符串（string） - 名称：问题题干（problem），数据类型：字符串（string） - 名称：难度等级（level），数据类型：字符串（string） - 名称：题型（type），数据类型：字符串（string） - 名称：解题过程（solution），数据类型：字符串（string） - 名称：标准化解题过程（sanitized_solution），数据类型：字符串（string） - 名称：模型名称（model_name），数据类型：字符串（string） - 名称：提示词（prompt），数据类型：字符串（string） - 名称：评分结果（scores），数据类型：布尔值序列（sequence: bool） - 名称：解析后输出（parsed_outputs），数据类型：字符串序列（sequence: string） - 名称：平均评分（mean_score），数据类型：64位浮点型（float64） - 名称：响应1（response_1），数据类型：字符串（string） - 名称：响应2（response_2），数据类型：字符串（string） - 名称：响应3（response_3），数据类型：字符串（string） - 名称：响应4（response_4），数据类型：字符串（string） - 名称：响应5（response_5），数据类型：字符串（string） - 名称：响应6（response_6），数据类型：字符串（string） - 名称：响应7（response_7），数据类型：字符串（string） - 名称：响应8（response_8），数据类型：字符串（string） - 名称：响应9（response_9），数据类型：字符串（string） - 名称：响应10（response_10），数据类型：字符串（string） - 名称：响应11（response_11），数据类型：字符串（string） - 名称：响应12（response_12），数据类型：字符串（string） - 名称：响应13（response_13），数据类型：字符串（string） - 名称：响应14（response_14），数据类型：字符串（string） - 名称：响应15（response_15），数据类型：字符串（string） - 名称：响应16（response_16），数据类型：字符串（string） - 名称：响应17（response_17），数据类型：字符串（string） - 名称：响应18（response_18），数据类型：字符串（string） - 名称：响应19（response_19），数据类型：字符串（string） - 名称：响应20（response_20），数据类型：字符串（string） - 名称：响应21（response_21），数据类型：字符串（string） - 名称：响应22（response_22），数据类型：字符串（string） - 名称：响应23（response_23），数据类型：字符串（string） - 名称：响应24（response_24），数据类型：字符串（string） - 名称：响应25（response_25），数据类型：字符串（string） - 名称：响应26（response_26），数据类型：字符串（string） - 名称：响应27（response_27），数据类型：字符串（string） - 名称：响应28（response_28），数据类型：字符串（string） - 名称：响应29（response_29），数据类型：字符串（string） - 名称：响应30（response_30），数据类型：字符串（string） - 名称：响应31（response_31），数据类型：字符串（string） - 名称：响应32（response_32），数据类型：字符串（string） - 名称：冲突样本对（conflict_pairs），数据类型：整型64位序列的嵌套序列（sequence: sequence of int64） - 名称：采样冲突样本对（sampled_conflict_pairs），数据类型：整型64位序列的嵌套序列（sequence: sequence of int64）数据集划分： - 划分名称：训练集（train），占用字节数：28121544，样本数量：512 下载大小：12452688 数据集总大小：28121544 配置信息： - 配置名称：默认配置（default），数据文件： - 划分：训练集（train），路径：data/train-* # 数据集概览本数据集为面向偏好代理评估（Preference Proxy Evaluations）的MATH正确性偏好评估集。其提示词（prompt）采样自[MATH](https://huggingface.co/datasets/hendrycks/competition_math)数据集。本数据集仅用于基准测试与模型评估，不可用于模型训练。 [论文](https://arxiv.org/abs/2410.14872) [代码](https://github.com/lmarena/PPE) # 授权协议用户提示词遵循MIT开源协议，模型输出内容需遵守对应模型提供商的使用条款。 # 引用格式 @misc{frick2024evaluaterewardmodelsrlhf, title={如何为基于人类反馈的强化学习（Reinforcement Learning from Human Feedback, RLHF）评估奖励模型}, author={Evan Frick and Tianle Li and Connor Chen and Wei-Lin Chiang and Anastasios N. Angelopoulos and Jiantao Jiao and Banghua Zhu and Joseph E. Gonzalez and Ion Stoica}, year={2024}, eprint={2410.14872}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2410.14872}, }

提供机构：

lmarena-ai

搜集汇总

数据集介绍

构建方式

lmarena-ai/PPE-MATH-Best-of-K数据集的构建，是基于MATH竞赛数学问题集，旨在为偏好代理评估提供正确性偏好评价集。数据集通过精心挑选的问题和对应的多个模型输出，构建了一个用于评估和基准测试的集合，而非用于训练。

特点

该数据集的特点在于，它包含了从MATH数据集中抽取的提示，并针对每个问题提供了多个模型的输出。数据集不仅包含了问题的原始表述和解决方案，还提供了冲突对和采样冲突对，以及每个模型输出的平均分数，从而为评估模型的正确性偏好提供了丰富的信息。

使用方法

使用该数据集时，研究者可以依据数据集中的问题、解决方案以及模型输出，进行偏好代理的评估。数据集分为训练集，但主要是用于评估而非训练。用户需要遵循相应的许可协议，并在使用时正确引用数据集来源。

背景与挑战

背景概述

在机器学习领域，尤其是自然语言处理任务中，准确评估模型的数学问题解决能力至关重要。lmarena-ai/PPE-MATH-Best-of-K数据集应运而生，旨在为研究者提供一种评估模型在数学问题解决方面性能的基准。该数据集由Evan Frick等研究人员于2024年创建，核心研究问题聚焦于如何客观评价强化学习中的奖励模型。数据集采集自MATH挑战，包含了数学问题的描述、难度级别、问题类型、解决方案及评分等信息。该数据集的推出，为相关领域的研究提供了有力支持，成为评估自然语言处理模型解决数学问题能力的重要工具。

当前挑战

构建lmarena-ai/PPE-MATH-Best-of-K数据集的过程中，研究者面临了多重挑战。首先，如何确保数据集中的数学问题具有广泛的难度级别和类型，以适应不同模型的评估需求。其次，数据集的标注工作需要高度精确，以保证评价结果的可靠性。此外，数据集在构建过程中还需克服如何平衡模型输出的多样性与正确性的挑战。在所解决的领域问题方面，数据集旨在评估模型在数学问题上的解决能力，面临的挑战包括如何准确捕捉模型在不同难度级别和类型问题上的表现差异，以及如何有效利用评价结果来指导模型的改进。

常用场景

经典使用场景

在人工智能领域，尤其是自然语言处理任务中，lmarena-ai/PPE-MATH-Best-of-K数据集的运用尤为引人注目。该数据集主要用于评估和基准测试偏好代理模型，其核心在于通过数学问题的正确性偏好评估来衡量模型的性能。典型的使用场景包括在模型开发阶段，通过该数据集提供的数学问题及其多个解答选项，训练模型学会识别和选择正确答案，进而提升其逻辑推理和问题解决能力。

衍生相关工作

基于lmarena-ai/PPE-MATH-Best-of-K数据集的研究成果，已经衍生出一系列相关的经典工作。这些研究不仅关注于模型性能的提升，还涉及对模型偏好、决策过程的理解和优化，进一步拓宽了人工智能在数学教育领域的应用边界。

数据集最近研究