lmarena-ai/PPE-MBPP-Plus-Best-of-K

Name: lmarena-ai/PPE-MBPP-Plus-Best-of-K
Creator: lmarena-ai
Published: 2024-10-22 08:00:06
License: 暂无描述

Hugging Face2024-10-22 更新2025-04-12 收录

下载链接：

https://hf-mirror.com/datasets/lmarena-ai/PPE-MBPP-Plus-Best-of-K

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: question_id dtype: string - name: task_id dtype: int64 - name: code dtype: string - name: prompt dtype: string - name: source_file dtype: string - name: test_imports sequence: string - name: test_list sequence: string - name: test dtype: string - name: model_name dtype: string - name: scores sequence: bool - name: mean_score dtype: float64 - name: response_1 dtype: string - name: response_2 dtype: string - name: response_3 dtype: string - name: response_4 dtype: string - name: response_5 dtype: string - name: response_6 dtype: string - name: response_7 dtype: string - name: response_8 dtype: string - name: response_9 dtype: string - name: response_10 dtype: string - name: response_11 dtype: string - name: response_12 dtype: string - name: response_13 dtype: string - name: response_14 dtype: string - name: response_15 dtype: string - name: response_16 dtype: string - name: response_17 dtype: string - name: response_18 dtype: string - name: response_19 dtype: string - name: response_20 dtype: string - name: response_21 dtype: string - name: response_22 dtype: string - name: response_23 dtype: string - name: response_24 dtype: string - name: response_25 dtype: string - name: response_26 dtype: string - name: response_27 dtype: string - name: response_28 dtype: string - name: response_29 dtype: string - name: response_30 dtype: string - name: response_31 dtype: string - name: response_32 dtype: string - name: conflict_pairs sequence: sequence: int64 - name: sampled_conflict_pairs sequence: sequence: int64 splits: - name: train num_bytes: 29926103 num_examples: 507 download_size: 10040281 dataset_size: 29926103 configs: - config_name: default data_files: - split: train path: data/train-* --- # Overview This contains the MBPP-Plus correctness preference evaluation set for Preference Proxy Evaluations. The prompts are sampled from [MBPP-Plus](https://huggingface.co/datasets/evalplus/mbppplus). This dataset is meant for benchmarking and evaluation, not for training. [Paper](https://arxiv.org/abs/2410.14872) [Code](https://github.com/lmarena/PPE) # License User prompts are licensed under Apache-2.0, and model outputs are governed by the terms of use set by the respective model providers. # Citation ``` @misc{frick2024evaluaterewardmodelsrlhf, title={How to Evaluate Reward Models for RLHF}, author={Evan Frick and Tianle Li and Connor Chen and Wei-Lin Chiang and Anastasios N. Angelopoulos and Jiantao Jiao and Banghua Zhu and Joseph E. Gonzalez and Ion Stoica}, year={2024}, eprint={2410.14872}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2410.14872}, } ```

dataset_info: 数据集信息: features: - 名称: question_id 数据类型: 字符串 - 名称: task_id 数据类型: 64位整型 - 名称: code 数据类型: 字符串 - 名称: prompt 数据类型: 提示词 - 名称: source_file 数据类型: 源文件 - 名称: 测试导入项数据类型: 字符串序列 - 名称: 测试用例列表数据类型: 字符串序列 - 名称: 测试代码数据类型: 字符串 - 名称: model_name 数据类型: 模型名称 - 名称: scores 数据类型: 布尔值序列 - 名称: mean_score 数据类型: 64位浮点型 - 名称: response_1 数据类型: 字符串 - 名称: response_2 数据类型: 字符串 - 名称: response_3 数据类型: 字符串 - 名称: response_4 数据类型: 字符串 - 名称: response_5 数据类型: 字符串 - 名称: response_6 数据类型: 字符串 - 名称: response_7 数据类型: 字符串 - 名称: response_8 数据类型: 字符串 - 名称: response_9 数据类型: 字符串 - 名称: response_10 数据类型: 字符串 - 名称: response_11 数据类型: 字符串 - 名称: response_12 数据类型: 字符串 - 名称: response_13 数据类型: 字符串 - 名称: response_14 数据类型: 字符串 - 名称: response_15 数据类型: 字符串 - 名称: response_16 数据类型: 字符串 - 名称: response_17 数据类型: 字符串 - 名称: response_18 数据类型: 字符串 - 名称: response_19 数据类型: 字符串 - 名称: response_20 数据类型: 字符串 - 名称: response_21 数据类型: 字符串 - 名称: response_22 数据类型: 字符串 - 名称: response_23 数据类型: 字符串 - 名称: response_24 数据类型: 字符串 - 名称: response_25 数据类型: 字符串 - 名称: response_26 数据类型: 字符串 - 名称: response_27 数据类型: 字符串 - 名称: response_28 数据类型: 字符串 - 名称: response_29 数据类型: 字符串 - 名称: response_30 数据类型: 字符串 - 名称: response_31 数据类型: 字符串 - 名称: response_32 数据类型: 字符串 - 名称: conflict_pairs 数据类型: 整型序列的序列 - 名称: sampled_conflict_pairs 数据类型: 整型序列的序列 splits: - 划分名称: 训练集字节数: 29926103 样本数量: 507 下载大小: 10040281 数据集总大小: 29926103 configs: - 配置名称: default 数据文件: - 划分: 训练集路径: data/train-* # 概述本数据集为用于偏好代理评估（Preference Proxy Evaluations）的MBPP-Plus正确性偏好评估集。其提示词采样自[MBPP-Plus](https://huggingface.co/datasets/evalplus/mbppplus)。本数据集仅用于基准测试与模型评估，不可用于模型训练。 [论文](https://arxiv.org/abs/2410.14872) [代码](https://github.com/lmarena/PPE) # 许可协议用户提示词遵循Apache-2.0许可协议，模型输出需遵守对应模型提供商的使用条款。 # 引用 @misc{frick2024evaluaterewardmodelsrlhf, title={面向人类反馈强化学习（RLHF）的奖励模型评估方法}, author={Evan Frick and Tianle Li and Connor Chen and Wei-Lin Chiang and Anastasios N. Angelopoulos and Jiantao Jiao and Banghua Zhu and Joseph E. Gonzalez and Ion Stoica}, year={2024}, eprint={2410.14872}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2410.14872}, }

提供机构：

lmarena-ai

搜集汇总

数据集介绍

构建方式

lmarena-ai/PPE-MBPP-Plus-Best-of-K数据集的构建，旨在为偏好代理评估提供MBPP-Plus正确性偏好评价集。该数据集的构建从MBPP-Plus中抽取提示，通过精心设计的采样策略，确保了数据的质量与评价的准确性。数据集包含多个字段，如问题ID、任务ID、代码、提示、源文件、测试导入、测试列表、测试结果、模型名称、评分、平均分数以及多个响应字段，这些字段共同构成了一个全面的评价体系。

特点

该数据集的特点在于其专为评估而设计，不适用于训练。它包含了从MBPP-Plus中精心挑选的提示，这些提示经过优化，用于评估模型的偏好代理性能。数据集的构成充分考虑了评估的多样性和全面性，涵盖了冲突对和抽样冲突对等多个维度，为研究者提供了一个多维度的评价视角。

使用方法

使用lmarena-ai/PPE-MBPP-Plus-Best-of-K数据集时，用户可以通过HuggingFace提供的平台轻松下载和加载。该数据集以Apache-2.0许可证授权用户提示，而模型输出则遵循各自模型提供商的使用条款。用户可以依据数据集中的字段，进行模型性能的评估和分析，进而优化模型的偏好代理能力。

背景与挑战

背景概述

lmarena-ai/PPE-MBPP-Plus-Best-of-K数据集，是在机器学习领域，特别是偏好代理评估领域的一项重要研究成果。该数据集由Evan Frick等人于2024年创建，依托于MBPP-Plus数据集，旨在为偏好代理评估提供基准和评估工具。数据集的核心研究问题是如何准确评估奖励模型在强化学习中的表现，其研究成果对于理解智能体在学习过程中的决策机制具有深远影响。

当前挑战

该数据集在构建过程中面临的挑战主要包括：首先，确保从MBPP-Plus中选取的样本能够充分代表各种偏好情况；其次，如何客观公正地评价模型输出的偏好，避免引入主观偏见；最后，数据集的规模和多样性对于评估模型的泛化能力提出了挑战。在所解决的领域问题上，该数据集需处理如何有效区分和量化不同模型在处理复杂偏好时的性能差异。

常用场景

经典使用场景

在自然语言处理领域，lmarena-ai/PPE-MBPP-Plus-Best-of-K数据集被广泛应用于评估和比较偏好代理模型。该数据集通过提供一系列针对代码片段正确性偏好的提示和对应的响应，成为基准测试和评估模型性能的重要资源。

解决学术问题

该数据集解决了如何准确评估奖励模型在强化学习中的表现这一学术难题。通过为研究者提供经过精心挑选的冲突对和相应的偏好标注，它帮助科研人员能够更加客观地评价模型在处理代码片段时的偏好准确性，从而推动相关领域的学术研究进展。

衍生相关工作

该数据集的发布促进了相关领域的研究，如偏好学习、代码质量评估以及强化学习模型在软件开发中的应用。许多后续研究基于此数据集，探索了更复杂的模型架构和训练策略，以进一步提高模型在代码偏好评估任务上的性能。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集