multipref

Name: multipref
Creator: maas
Published: 2025-11-27 16:35:24
License: 暂无描述

魔搭社区2025-11-27 更新2025-05-31 收录

下载链接：

https://modelscope.cn/datasets/allenai/multipref

下载链接

链接失效反馈

官方服务：

资源简介：

# MultiPref - a multi-annotated and multi-aspect human preference dataset - **Paper link:** [Hybrid Preferences: Learning to Route Instances for Human vs. AI Feedback](https://arxiv.org/abs/2410.19133) ## Dataset Summary The MultiPref dataset (version 1.0) is a rich collection of 10k human preferences. It is: - **Multi-annotated**: each instance is annotated multiple times—twice by normal crowdworkers and twice by domain-experts— resulting in around 40k annotations. - **Multi-aspect**: aside from their Overall preference, annotators choose their preferred response on a five-point Likert scale across three aspects: Helpfulness, Truthfulness, and Harmlessness. In addition, annotators also indicate the reason why they think one response is better than the other. MultiPref can be used to train reward models for preference tuning or to analyze difference between different types of annotators. ## Configuration and Splits On the other hand, the `default` configuration contains all the finegrained annotations for MultiPref. Here's a note on some of the other subsets: - `human_overall_binarized`: gets the majority `overall_pref` of all human annotators. If there's a tie, it chooses a random response. We also add a column, named `tie_is_common (bool)` so you can filter out instances where we chose a random response. - `gpt4_overall_binarized`: the `overall_pref` of gpt-4 (`gpt-4-turbo-2024-04-09`) using the same annotation guidelines we used for human annotations. ## Dataset Structure ### Data Instances An example of a MultiPref instance (`original` configuration) is as follows: ``` { 'comparison_id': 'd0be61f1d511417f96d22ed5d15a3a16', 'prompt_id': '00d2d503f8134dba83c5d1215a4594cf', 'text': 'who would win, a college championship level mens basketball team or a WNBA team?', 'model_a': 'allenai/tulu-2-70b', 'completion_a': "It's difficult to say definitively who would win in a hypothetical matchup between a college championship-level men's basketball team and a WNBA team, as it would depend on a variety of factors, including the specific teams involved and the style of play. However, in general, the level of competition in the WNBA is very high and the players are highly skilled, so it's likely that a WNBA team would be able to compete with a college championship-level men's team.", 'model_b': 'gpt-4-turbo-2024-04-09', 'completion_b': "Deciding who would win between a college championship-level men's basketball team and a WNBA team is speculative and depends on numerous factors, such as team chemistry, playing styles, coaching strategies, and individual player skills. However, we can consider some key aspects:\n\n1. **Physical Differences**: Generally, men's teams tend to have a physical advantage in terms of size, strength, and athleticism, due to inherent biological differences. This can influence the game significantly, especially in terms of speed, jumping ability, and physicality.\n\n2. **Skill and Experience**: WNBA players are professional athletes who have usually played at the college level themselves and have continued to refine their skills. They often have superior technical skills, tactical awareness, and experience compared to college players. WNBA players also usually have experience playing both nationally and internationally, which can contribute to a higher level of overall gameplay.\n\n3. **Coaching and Strategy**: Both teams would likely be well-coached, but WNBA coaches are managing athletes who are at the professional level and often have more experience in integrating complex strategies and adjustments during the game.\n\nGiven these factors, the physical advantages of a men's college team might be countered by the superior experience and skill level of a WNBA team. The outcome could vary depending on the specific teams involved and the conditions of the matchup (e.g., rules, refereeing style, etc.).\n\nIn essence, while a men's college team might exhibit greater physical prowess, a WNBA team's professional experience, strategic execution, and technical skills could very well tilt the scale in their favor in many scenarios. Ultimately, each game would likely be highly competitive and the winner could vary from one matchup to another.", 'source': 'lmsys/chatbot_arena_conversations', 'category': 'Open QA', 'subject_study': 'Sports', 'highest_level_degree': 'Social Sciences', 'normal_worker_annotations': [...], 'expert_worker_annotations': [...] } ``` An annotation dictionary (as seen inside `normal_worker_annotations`) looks like the following: ``` { 'helpful_pref': 'B-is-clearly-better', 'helpful_checked_reasons': ['informative'], 'helpful_own_reason': '', 'helpful_confidence': 'absolutely-confident', 'truthful_pref': 'Tie', 'truthful_checked_reasons': [], 'truthful_own_reason': '', 'truthful_confidence': 'absolutely-confident', 'harmless_pref': 'Tie', 'harmless_checked_reasons': [], 'harmless_own_reason': '', 'harmless_confidence': 'absolutely-confident', 'overall_pref': 'Tie', 'overall_confidence': 'absolutely-confident', 'evaluator': 'jovial_shirley', 'time_spent': 283, 'timestamp': '2024-05-21 19:18:25.735159' } ``` ### Data Fields Each instance of MultiPref contains the following fields: - **comparison_id** (`str`): the unique ID of the comparison. Some prompts are duplicated but with different responses. Can be used as the primary key. - **prompt_id** (`str`): the unique ID of the prompt. Similar prompts will have the same ID. - **text** (`str`): the text prompt. - **completion_a** (`str`): the first model generated response. - **model_a** (`str`): the model used to generate `completion_a`. - **completion_b** (`str`): the second model generated response. - **model_b** (`str`): the model used to generate `completion_b`. - **source** (`str`): the dataset where the prompt was sourced from. - **category** (`str`): the task category for the prompt. Follows [HuggingFaceH4/no_robots](https://huggingface.co/datasets/HuggingFaceH4/no_robots)'s taxonomy. - **subject_study** (`str`): the subject expertise required to answer the question. - **highest_level_degree** (str): the degree subject required for an annotator to answer the question. We used this to route a specific prompt to the relevant domain-expert. - **normal_worker_annotations** (`list[dict[str, Any]]`): annotations of two different normal crowdworkers. - **expert_worker_annotations** (`list[dict[str, Any]]`): annotations of two different expert crowdworkers. Each annotation (`dict[str, Any]`) contains the following fields: - **{helpful,truthful,harmless}_pref** (`str`): the annotator's preference based on each aspect based on a 5-point Likert scale: A-is-clearly-better, A-is-slightly-better, Tie, B-is-slightly-better, B-is-clearly-better. - **{helpful,truthful,harmless}_checked_reasons** (`list[str]`): the reasons why an annotator chose a response over the other. - **{helpful,truthful,harmless}_own_reason** (`str`): a freeform reason why the annotator chose a particular response. - **{helpful,truthful,harmless}_confidence** (`str`): the confidence rating of the annotator. - **overall_pref** (`str`): the overall preference after taking each aspect in consideration. - **overall_confidence** (`str`): the overall confidence rating of the annotator. - **evaluator** (`str`): unique annotator ID. We desensitized the actual ID using [Docker-style names generator](https://github.com/shamrin/namesgenerator). - **time_spent** (`int`): time spent annotating (in seconds). - **timestamp** (`time`): the recorded timestamp when the annotator submitted their annotation. ## Dataset Creation ![image/png](https://cdn-uploads.huggingface.co/production/uploads/634e20a0c1ce28f1de920cc4/iAMo5k4B8VxpbYySLBubj.png) ### Source Data The prompts were sourced from the following datasets: | Prompt Source | Number of prompts | |--------------------------------------|-------------------| | Anthropic Helpful (Bai et al., 2022a) | 1,516 | | ChatArena Convers. (Zheng et al., 2023) | 1,100 | | ShareGPT (Chiang et al., 2023) | 1,031 | | Anthropic Harmless (Bai et al., 2022a) | 856 | | WildChat (Zhao et al., 2024) | 820 | For each prompt, we generate two responses from six different models: Tülu 2 7B and 70B (Wang et al., 2023a; Ivison et al., 2023), Llama 2 and 3 70B (Touvron et al., 2023; Dubey et al., 2024), GPT-3.5 (Ouyang et al., 2022), and GPT-4 (Achiam et al., 2023). Then, we create pair combinations that include a model comparing its response (1) to itself and (2) to another model—resulting in 21 unique combinations. Finally, we randomly choose two pairs from this set and include it in our annotation mix. ## Annotations ### Who are the annotators? We recruit normal crowdworkers from Prolific with at least 99% approval rate, fluent in English, and have completed a Bachelor’s degree. Expert crowdworkers, at minimum, should have a graduate degree. Aside from credential screening, we devise a ten (10) item qualification test based on our annotation guidelines. Participants must score at least 90% to be included in the study. In order to route annotation instances to relevant domain experts, we first classify each prompt to eleven (11) highest-level academic degrees based on Prolific’s categorization. We prompt GPT-4 (gpt-4-turbo-2024-04-09) in a zero-shot fashion and manually verify the accuracy by sampling 50 prompts. | Domain | # Annotators | Pass Rate | # Prompts | |-------------------------------------------|--------------|-----------|-----------| | Administration & Law | 16 | 36.5% | 341 | | Arts & Humanities | 32 | 43.0% | 1,147 | | Education | 17 | 32.0% | 353 | | Engineering, manufacturing, and construction | 14 | 27.0% | 315 | | Health and Welfare | 22 | 23.0% | 768 | | History | 11 | 44.0% | 161 | | Information and Communication Technologies | 24 | 24.0% | 668 | | Journalism & Information Business | 10 | 33.0% | 222 | | Mathematics and statistics | 13 | 32.5% | 278 | | Natural sciences | 17 | 41.5% | 384 | | Social Sciences | 23 | 27.1% | 686 | | **Expert Crowdworkers (Total)** | **199** | **33.0%** | | | **Normal Crowdworkers** | **90** | **36.5%** | | ### Annotation Process We formulate the annotation task such that annotators will specify not only their general preference, but also their preference across three aspects—helpfulness, truthfulness, and harmlessness. In addition, we also ask them the reason why they preferred a response over the other given a set of well-defined attributes. Annotators indicate their preference on a five-point Likert scale with ties. ### Personal and Sensitive Information The dataset does not contain any personal information about the crowdworkers but might contain information regarding their annotation activity (time spent annotating, time of submission). We desensitized their annotator IDs using [docker-style names generator](https://github.com/shamrin/namesgenerator). ## Additional Information ### Citation Information Please cite the following if you're using MultiPref in your work: ```bib @article{miranda2024hybrid, title = {{Hybrid Preferences: Learning to Route Instances for Human vs. AI Feedback}}, author = {Lester James V. Miranda and Yizhong Wang and Yanai Elazar and Sachin Kumar and Valentina Pyatkin and Faeze Brahman and Noah A. Smith and Hannaneh Hajishirzi and Pradeep Dasigi}, journal = {{arXiv}}, year = {2024}, month = oct, volume = {abs/2410.19133} } ```

# MultiPref——多标注多维度人类偏好数据集 - **论文链接**：[Hybrid Preferences: Learning to Route Instances for Human vs. AI Feedback](https://arxiv.org/abs/2410.19133) ## 数据集概述 MultiPref数据集（版本1.0）是包含10000条人类偏好标注的丰富集合，具备以下两大特性： - **多标注特性**：每个样本均经过多次标注——由普通众包工作者（normal crowdworkers）标注两次，领域专家（domain-experts）标注两次，总计产生约40000条标注结果。 - **多维度特性**：除整体偏好外，标注者需从三个维度以5级李克特量表（five-point Likert scale）选出更优回复：有用性（Helpfulness）、真实性（Truthfulness）与无害性（Harmlessness）。此外，标注者还需阐明认为某一回复优于另一回复的理由。 MultiPref可用于训练用于偏好微调的奖励模型（reward model），或分析不同类型标注者之间的差异。 ## 配置与划分 `default`配置包含MultiPref的所有细粒度标注。以下是部分其他子集的说明： - `human_overall_binarized`：获取所有人类标注者的多数票整体偏好（`overall_pref`）。若出现平局，则随机选择一个回复。我们还新增了名为`tie_is_common (bool)`的字段，以便过滤掉需要随机选择回复的样本。 - `gpt4_overall_binarized`：基于与人类标注一致的标注准则，由GPT-4（`gpt-4-turbo-2024-04-09`）生成的整体偏好（`overall_pref`）结果。 ## 数据集结构 ### 数据样本 MultiPref（`original`配置）的一个示例样本如下： { 'comparison_id': 'd0be61f1d511417f96d22ed5d15a3a16', 'prompt_id': '00d2d503f8134dba83c5d1215a4594cf', 'text': "大学冠军级男子篮球队与WNBA球队，哪支会获胜？", 'model_a': 'allenai/tulu-2-70b', 'completion_a': "虽然很难绝对判定大学冠军级男子篮球队与WNBA球队在假设性对决中的胜负，这取决于诸多因素，包括具体参赛队伍与比赛风格。但总体而言，WNBA的竞技水准极高，球员技艺精湛，因此WNBA球队大概率能够与大学冠军级男子球队一较高下。", 'model_b': 'gpt-4-turbo-2024-04-09', 'completion_b': "判断大学冠军级男子篮球队与WNBA球队的胜负属于推测范畴，取决于诸多因素，比如球队化学反应、比赛风格、执教策略以及球员个人技能。我们可以从以下几个关键维度进行分析： 1. **身体差异**：通常而言，由于先天生理差异，男子球队在身高、力量与运动能力上具备身体优势，这会对比赛产生显著影响，尤其是在速度、弹跳能力与身体对抗方面。 2. **技艺与经验**：WNBA球员均为职业运动员，通常都曾参与大学赛事，并持续精进自身技艺。相较于大学球员，她们往往具备更精湛的技术、更强的战术意识与更丰富的比赛经验。此外，WNBA球员通常拥有国内与国际赛事的参赛经历，这有助于提升整体竞技水准。 3. **执教与策略**：双方球队大概率都具备出色的执教团队，但WNBA的教练所带领的是职业级运动员，通常更擅长在比赛中整合复杂战术与临场调整。综合上述因素，男子大学球队的身体优势可能会被WNBA球队更出色的经验与技艺水平抵消。比赛结果会因具体参赛队伍与对决条件（比如比赛规则、裁判风格等）而异。简言之，尽管男子大学球队可能展现出更强的身体对抗能力，但WNBA球队的职业赛事经验、战术执行能力与技术水准，在诸多场景下都可能成为获胜的关键。最终，每场比赛都将极具竞争力，胜负结果会因对决不同而有所变化。", 'source': 'lmsys/chatbot_arena_conversations', 'category': '开放式问答', 'subject_study': '体育', 'highest_level_degree': '社会科学', 'normal_worker_annotations': [...], 'expert_worker_annotations': [...] } 一个标注字典（即`normal_worker_annotations`内部的结构）如下所示： { 'helpful_pref': 'B-is-clearly-better', 'helpful_checked_reasons': ['informative'], 'helpful_own_reason': '', 'helpful_confidence': 'absolutely-confident', 'truthful_pref': 'Tie', 'truthful_checked_reasons': [], 'truthful_own_reason': '', 'truthful_confidence': 'absolutely-confident', 'harmless_pref': 'Tie', 'harmless_checked_reasons': [], 'harmless_own_reason': '', 'harmless_confidence': 'absolutely-confident', 'overall_pref': 'Tie', 'overall_confidence': 'absolutely-confident', 'evaluator': 'jovial_shirley', 'time_spent': 283, 'timestamp': '2024-05-21 19:18:25.735159' } ### 数据字段每个MultiPref样本包含以下字段： - **comparison_id**（`str`）：该对比任务的唯一标识符。部分提示词可能重复但对应不同回复，该字段可作为主键。 - **prompt_id**（`str`）：该提示词的唯一标识符。相似提示词将共享同一ID。 - **text**（`str`）：原始提示词文本。 - **completion_a**（`str`）：第一个模型生成的回复。 - **model_a**（`str`）：生成`completion_a`的模型名称。 - **completion_b**（`str`）：第二个模型生成的回复。 - **model_b**（`str`）：生成`completion_b`的模型名称。 - **source**（`str`）：该提示词的来源数据集。 - **category**（`str`）：该提示词对应的任务类别，遵循[HuggingFaceH4/no_robots](https://huggingface.co/datasets/HuggingFaceH4/no_robots)的分类体系。 - **subject_study**（`str`）：回答该问题所需的学科专业领域。 - **highest_level_degree**（`str`）：标注者回答该问题所需的最高学历学科类别，我们以此将特定提示词分配给相关领域专家。 - **normal_worker_annotations**（`list[dict[str, Any]]`）：两名不同普通众包工作者的标注结果列表。 - **expert_worker_annotations**（`list[dict[str, Any]]`）：两名不同领域专家的标注结果列表。每个标注字典（`dict[str, Any]`）包含以下字段： - **{helpful,truthful,harmless}_pref**（`str`）：标注者基于各维度的偏好选择，采用5级李克特量表，可选值为：A-is-clearly-better、A-is-slightly-better、Tie、B-is-slightly-better、B-is-clearly-better。 - **{helpful,truthful,harmless}_checked_reasons**（`list[str]`）：标注者选择该回复的预设勾选理由。 - **{helpful,truthful,harmless}_own_reason**（`str`）：标注者自行阐述的选择该回复的自由文本理由。 - **{helpful,truthful,harmless}_confidence**（`str`）：标注者对本次选择的置信度评分。 - **overall_pref**（`str`）：综合所有维度后的整体偏好选择。 - **overall_confidence**（`str`）：标注者对整体偏好选择的置信度评分。 - **evaluator**（`str`）：标注者的唯一ID，我们使用[Docker风格名称生成器](https://github.com/shamrin/namesgenerator)对真实ID进行了脱敏处理。 - **time_spent**（`int`）：标注耗时（单位：秒）。 - **timestamp**（`time`）：标注者提交标注的记录时间戳。 ## 数据集构建 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/634e20a0c1ce28f1de920cc4/iAMo5k4B8VxpbYySLBubj.png) ### 源数据提示词来源于以下数据集： | 提示词来源 | 提示词数量 | |-------------------------------------|------------| | Anthropic Helpful (Bai et al., 2022a) | 1516 | | ChatArena 对话集 (Zheng et al., 2023) | 1100 | | ShareGPT (Chiang et al., 2023) | 1031 | | Anthropic Harmless (Bai et al., 2022a) | 856 | | WildChat (Zhao et al., 2024) | 820 | 针对每个提示词，我们从6个不同模型生成两条回复：Tülu 2 7B与70B（Wang et al., 2023a; Ivison et al., 2023）、Llama 2与3 70B（Touvron et al., 2023; Dubey et al., 2024）、GPT-3.5（Ouyang et al., 2022）以及GPT-4（Achiam et al., 2023）。随后，我们构建成对组合，包含（1）同一模型的回复对比以及（2）不同模型的回复对比——总计产生21种唯一组合。最后，我们从该集合中随机选取两组对，纳入标注集。 ### 标注环节 #### 标注者构成我们从Prolific平台招募普通众包工作者，要求其至少拥有99%的任务通过率、英语流利且已完成本科学业。领域专家众包工作者则至少需具备研究生学历。除资质筛选外，我们还基于标注准则设计了10道资格测试题，参与者需至少获得90%的分数才能参与本研究。为了将标注样本分配给相关领域专家，我们首先基于Prolific的分类体系，将每个提示词划分为11类最高学历学科类别。我们以零样本（zero-shot）方式提示GPT-4（`gpt-4-turbo-2024-04-09`）完成分类，并通过抽样50个提示词手动验证分类准确性。 | 学科领域 | 标注者数量 | 通过率 | 提示词数量 | |-------------------------------------------|------------|--------|------------| | 管理与法学 | 16 | 36.5% | 341 | | 艺术与人文 | 32 | 43.0% | 1147 | | 教育学 | 17 | 32.0% | 353 | | 工程、制造与建筑 | 14 | 27.0% | 315 | | 健康与福利 | 22 | 23.0% | 768 | | 历史学 | 11 | 44.0% | 161 | | 信息与通信技术 | 24 | 24.0% | 668 | | 新闻与信息商务 | 10 | 33.0% | 222 | | 数学与统计学 | 13 | 32.5% | 278 | | 自然科学 | 17 | 41.5% | 384 | | 社会科学 | 23 | 27.1% | 686 | | **领域专家众包工作者（总计）** | **199** | **33.0%** | | | **普通众包工作者** | **90** | **36.5%** | | #### 标注流程我们设计的标注任务要求标注者不仅需说明整体偏好，还需从有用性、真实性与无害性三个维度分别给出偏好选择。此外，我们还要求标注者基于一组明确定义的属性，阐明认为某一回复优于另一回复的理由。标注者需通过带平局选项的5级李克特量表表明自身偏好。 #### 个人与敏感信息本数据集不包含众包工作者的任何个人信息，但可能包含其标注活动相关信息（如标注耗时、提交时间）。我们使用[Docker风格名称生成器](https://github.com/shamrin/namesgenerator)对标注者ID进行了脱敏处理。 ## 附加信息 ### 引用信息若您在研究中使用MultiPref，请引用以下文献： bib @article{miranda2024hybrid, title = {{Hybrid Preferences: Learning to Route Instances for Human vs. AI Feedback}}, author = {Lester James V. Miranda and Yizhong Wang and Yanai Elazar and Sachin Kumar and Valentina Pyatkin and Faeze Brahman and Noah A. Smith and Hannaneh Hajishirzi and Pradeep Dasigi}, journal = {{arXiv}}, year = {2024}, month = oct, volume = {abs/2410.19133} }

提供机构：

maas

创建时间：

2025-05-28

5,000+

优质数据集

54 个

任务类型

进入经典数据集