five

href_preference

收藏
魔搭社区2025-12-05 更新2025-05-31 收录
下载链接:
https://modelscope.cn/datasets/allenai/href_preference
下载链接
链接失效反馈
官方服务:
资源简介:
# HREF: Human Reference-Guided Evaluation of Instruction Following in Language Models <!-- Provide a quick summary of the dataset. --> <div align="left"> 📑 [Paper](https://arxiv.org/abs/2412.15524) | 🤗 [Leaderboard](https://huggingface.co/spaces/allenai/href) | 📁 [Codebase](https://github.com/allenai/href) </div> HREF is evaluation benchmark that evaluates language models' capacity of following human instructions. This dataset contains the **human agreement set** of HREF, which contains 1,752 pairs of language model outputs along with the preference data from 4 human annotators for each model pairs. The dataset contains 438 instructions human-written instruction and response pairs a mix of the train and test split of [No Robots](https://huggingface.co/datasets/HuggingFaceH4/no_robots), covering 8 categories (removing Coding and Chat). We use this dataset to evaluate the automatic evaluation methods in our paper. ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64dff1ddb5cc372803af964d/dSv3U11h936t_q-aiqbkV.png) ## Data Fields - `category`: A category label of the instruction following the instruction-tuning terminology. Full list: Brainstorm, Open QA, Closed QA, Extract, Generation, Rewrite, Summarize, Classify. - `instruction`: A text written by human experts to be used as an input to a language model. - `generator_a`: The name of the first langauge model whose repsonse is to be compared. - `output_a`: The response generated by `generator_a`. - `generator_b`: The name of the second langauge model whose repsonse is to be compared. - `output_b`: The response generated by `generator_b`. - `annotations`: A list of preference annotations by our human annotators. `1` represnets that `output_a` is prefered. `2` represents that `output_b` is prefered. `0` symbolizes a tie. - `annotators`: A list of unique ids of the corresponding human annotators that creates `annotations`. - `reference`: A response to the `instruction` written by the same human expert who writes the `instruction`. ## Model Pool In order to ensure the diversity of the responses, we build a model pool with 32 LLMs with sizes from 7B to over 100B and more than 10 different model families to compare against the baseline model `Llama-3.1-405B-Instruct-FP8`. . To ensure the quality of the response and avoid looping repetitions in generation, we use a decoding temperature of 1.0 for all the models. ## Annotation Collection We hire annotators from [Prolific](https://www.prolific.com/) who are English-native speakers only from the U.S., the U.K., and Canada, who must have Bachelor’s de grees or above. We also require the annotators to have have a approval rate over 99% in the studies that they have participated in the past. We launch a qualification study as where a participant needs to correctly annotation at least 9 out of 10 straightforwardly distinguish- able model response pairs to pass. We assign the qualification task to 50 participants, and recruit 16 of them as our final group of annotators. We set the hourly salary to be $16 / hour. ## Why HREF | Benchmark | Size | Evaluation Method | Baseline Model | Judge Model | Task Oriented | Contamination Resistant | Contains Human Reference| |--------------------|-------|------------|----------------|----------------|----------|------------|-----------| | MT-Bench | 80 | Score | --- | gpt4 | ✓ | ✗ | ✗ | | AlpacaEval 2.0 | 805 | PWC | gpt4-turbo | gpt4-turbo | ✗ | ✗ | ✗ | | Chatbot Arena | --- | PWC | --- | Human | ✗ | ✓ | ✗ | | Arena-Hard | 500 | PWC | gpt4-0314 | gpt4-turbo | ✗ | ✗ | ✗ | | WildBench | 1,024 | Score/PWC | gpt4-turbo | three models | ✗ | ✗ | ✗ | | **HREF** | 4,258 | PWC | Llama-3.1-405B-Instruct-FP8 | Llama-3.1-70B-Instruct | ✓ | ✓ | ✓ | - **Human Reference**: HREF leverages human-written answer as reference to provide more reliable evaluation than previous method. - **Large**: HREF has the largest evaluation size among similar benchmarks, making its evaluation more reliable. - **Contamination-resistant**: HREF's evaluation set is hidden and uses public models for both the baseline model and judge model, which makes it completely free of contamination. - **Task Oriented**: Instead of naturally collected instructions from the user, HREF contains instructions that are written specifically targetting 8 distinct categories that are used in instruction tuning, which allows it to provide more insights about how to improve language models. ## Usage ```python from datasets import load_dataset href_data = load_dataset("allenai/href_human_agreement", split="train") ``` ## Citation ``` @article{lyu2024href, title={HREF: Human Response-Guided Evaluation of Instruction Following in Language Models}, author={Xinxi Lyu and Yizhong Wang and Hannaneh Hajishirzi and Pradeep Dasigi}, journal={arXiv preprint arXiv:2412.15524}, year={2024} } ```

# HREF:面向语言模型指令遵循能力的人类参考导向评估基准 <!-- 提供数据集的简要概述。 --> <div align="left"> 📑 [论文](https://arxiv.org/abs/2412.15524) | 🤗 [排行榜](https://huggingface.co/spaces/allenai/href) | 📁 [代码仓库](https://github.com/allenai/href) </div> HREF是一款用于评估语言模型遵循人类指令能力的基准测试集。本数据集包含HREF的**人类共识子集**,其中涵盖1752组语言模型输出对,且每组模型输出对均配有4名人类标注员的偏好标注数据。该数据集包含438条由人类专家撰写的指令与回复对,数据源自[No Robots](https://huggingface.co/datasets/HuggingFaceH4/no_robots)的训练与测试拆分混合集,覆盖8大类别(已移除编码与对话类别)。本数据集用于在我们的论文中评估自动评估方法。 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64dff1ddb5cc372803af964d/dSv3U11h936t_q-aiqbkV.png) ## 数据字段 - `category`:遵循指令微调术语体系的指令类别标签。完整类别列表:头脑风暴、开放域问答、封闭域问答、信息抽取、文本生成、文本改写、文本摘要、文本分类。 - `instruction`:由人类专家撰写、用作语言模型输入的文本指令。 - `generator_a`:待比较的首个语言模型的名称。 - `output_a`:`generator_a`生成的回复内容。 - `generator_b`:待比较的第二个语言模型的名称。 - `output_b`:`generator_b`生成的回复内容。 - `annotations`:人类标注员给出的偏好标注列表。其中`1`代表偏好`output_a`,`2`代表偏好`output_b`,`0`则表示两者平局。 - `annotators`:生成对应`annotations`的人类标注员唯一ID列表。 - `reference`:与`instruction`对应的回复,由撰写该指令的同一位人类专家编写。 ## 模型池 为确保回复的多样性,我们构建了包含32个大语言模型(Large Language Model, LLM)的模型池,模型参数量范围为7B至100B以上,涵盖超过10种不同的模型家族,用于与基准模型`Llama-3.1-405B-Instruct-FP8`进行对比。为保证回复质量并避免生成时出现循环重复,我们为所有模型设置了解码温度为1.0。 ## 标注收集 我们从[Prolific](https://www.prolific.com/)平台招募标注员,仅招募来自美国、英国、加拿大的英语母语使用者,且要求标注员拥有学士学位及以上学历。同时,我们要求标注员在过往参与的研究中,标注通过率需达到99%以上。我们推出了资格考核任务:参与者需在10组极易区分的模型回复对中正确标注至少9组,方可通过考核。我们为50名参与者发放了资格考核任务,最终招募其中16人作为正式标注团队。我们设定的时薪为16美元/小时。 ## 为何选择HREF | 基准测试集 | 样本量 | 评估方法 | 基准模型 | 评判模型 | 任务导向型 | 抗污染性 | 包含人类参考回复 | |--------------------|-------|------------|----------------|----------------|----------|------------|-----------| | MT-Bench | 80 | 评分 | 无 | GPT-4 | 是 | 否 | 否 | | AlpacaEval 2.0 | 805 | 成对比较(Pairwise Comparison, PWC) | GPT-4 Turbo | GPT-4 Turbo | 否 | 否 | 否 | | Chatbot Arena | 未知 | 成对比较(PWC) | 无 | 人类标注员 | 否 | 是 | 否 | | Arena-Hard | 500 | 成对比较(PWC) | GPT-4-0314 | GPT-4 Turbo | 否 | 否 | 否 | | WildBench | 1,024 | 评分/成对比较(PWC) | GPT-4 Turbo | 三个模型 | 否 | 否 | 否 | | **HREF** | 4,258 | 成对比较(PWC) | Llama-3.1-405B-Instruct-FP8 | Llama-3.1-70B-Instruct | 是 | 是 | 是 | - **人类参考回复**:HREF采用人类撰写的答案作为参考,相较于此前的方法,能够提供更可靠的评估结果。 - **大规模性**:HREF在同类基准测试集中拥有最大的评估样本量,使得其评估结果更具可信度。 - **抗污染性**:HREF的评估集处于隐藏状态,且基准模型与评判模型均采用公开模型,因此完全不存在数据污染问题。 - **任务导向型**:HREF并非收集自用户的自然指令,而是专门针对指令微调中常用的8个不同类别撰写的指令,这使得该基准能够为如何改进语言模型提供更具针对性的见解。 ## 使用方法 python from datasets import load_dataset href_data = load_dataset("allenai/href_human_agreement", split="train") ## 引用 @article{lyu2024href, title={HREF: Human Response-Guided Evaluation of Instruction Following in Language Models}, author={Xinxi Lyu and Yizhong Wang and Hannaneh Hajishirzi and Pradeep Dasigi}, journal={arXiv preprint arXiv:2412.15524}, year={2024} }
提供机构:
maas
创建时间:
2025-05-28
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作