google/IFEval

Name: google/IFEval
Creator: google
Published: 2024-08-14 08:21:56
License: 暂无描述

Hugging Face2024-08-14 更新2025-04-08 收录

下载链接：

https://hf-mirror.com/datasets/google/IFEval

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 task_categories: - text-generation language: - en pretty_name: IFEval --- # Dataset Card for IFEval  ## Dataset Description - **Repository:** https://github.com/google-research/google-research/tree/master/instruction_following_eval - **Paper:** https://huggingface.co/papers/2311.07911 - **Leaderboard:** https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard - **Point of Contact:** [Le Hou](lehou@google.com) ### Dataset Summary This dataset contains the prompts used in the [Instruction-Following Eval (IFEval) benchmark](https://arxiv.org/abs/2311.07911) for large language models. It contains around 500 "verifiable instructions" such as "write in more than 400 words" and "mention the keyword of AI at least 3 times" which can be verified by heuristics. To load the dataset, run: ```python from datasets import load_dataset ifeval = load_dataset("google/IFEval") ``` ### Supported Tasks and Leaderboards The IFEval dataset is designed for evaluating chat or instruction fine-tuned language models and is one of the core benchmarks used in the [Open LLM Leaderboard](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard). ### Languages The data in IFEval are in English (BCP-47 en). ## Dataset Structure ### Data Instances An example of the `train` split looks as follows: ``` { "key": 1000, "prompt": 'Write a 300+ word summary of the wikipedia page "https://en.wikipedia.org/wiki/Raymond_III,_Count_of_Tripoli". Do not use any commas and highlight at least 3 sections that has titles in markdown format, for example *highlighted section part 1*, *highlighted section part 2*, *highlighted section part 3*.', "instruction_id_list": [ "punctuation:no_comma", "detectable_format:number_highlighted_sections", "length_constraints:number_words", ], "kwargs": [ { "num_highlights": None, "relation": None, "num_words": None, "num_placeholders": None, "prompt_to_repeat": None, "num_bullets": None, "section_spliter": None, "num_sections": None, "capital_relation": None, "capital_frequency": None, "keywords": None, "num_paragraphs": None, "language": None, "let_relation": None, "letter": None, "let_frequency": None, "end_phrase": None, "forbidden_words": None, "keyword": None, "frequency": None, "num_sentences": None, "postscript_marker": None, "first_word": None, "nth_paragraph": None, }, { "num_highlights": 3, "relation": None, "num_words": None, "num_placeholders": None, "prompt_to_repeat": None, "num_bullets": None, "section_spliter": None, "num_sections": None, "capital_relation": None, "capital_frequency": None, "keywords": None, "num_paragraphs": None, "language": None, "let_relation": None, "letter": None, "let_frequency": None, "end_phrase": None, "forbidden_words": None, "keyword": None, "frequency": None, "num_sentences": None, "postscript_marker": None, "first_word": None, "nth_paragraph": None, }, { "num_highlights": None, "relation": "at least", "num_words": 300, "num_placeholders": None, "prompt_to_repeat": None, "num_bullets": None, "section_spliter": None, "num_sections": None, "capital_relation": None, "capital_frequency": None, "keywords": None, "num_paragraphs": None, "language": None, "let_relation": None, "letter": None, "let_frequency": None, "end_phrase": None, "forbidden_words": None, "keyword": None, "frequency": None, "num_sentences": None, "postscript_marker": None, "first_word": None, "nth_paragraph": None, }, ], } ``` ### Data Fields The data fields are as follows: * `key`: A unique ID for the prompt. * `prompt`: Describes the task the model should perform. * `instruction_id_list`: An array of verifiable instructions. See Table 1 of the paper for the full set with their descriptions. * `kwargs`: An array of arguments used to specify each verifiable instruction in `instruction_id_list`. ### Data Splits | | train | |---------------|------:| | IFEval | 541 | ### Licensing Information The dataset is available under the [Apache 2.0 license](https://www.apache.org/licenses/LICENSE-2.0). ### Citation Information ``` @misc{zhou2023instructionfollowingevaluationlargelanguage, title={Instruction-Following Evaluation for Large Language Models}, author={Jeffrey Zhou and Tianjian Lu and Swaroop Mishra and Siddhartha Brahma and Sujoy Basu and Yi Luan and Denny Zhou and Le Hou}, year={2023}, eprint={2311.07911}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2311.07911}, } ```

--- license: apache-2.0 task_categories: - 文本生成 language: - en pretty_name: IFEval --- # IFEval 数据集卡片  ## 数据集描述 - **代码仓库：** https://github.com/google-research/google-research/tree/master/instruction_following_eval - **论文：** https://huggingface.co/papers/2311.07911 - **排行榜：** https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard - **联系人：** [Le Hou](lehou@google.com) ### 数据集概览本数据集包含用于大语言模型（Large Language Model, LLM）的[指令遵循评估（Instruction-Following Eval, IFEval）基准测试](https://arxiv.org/abs/2311.07911)的提示词。其中包含约500条可通过启发式规则验证的“可验证指令”，例如“撰写超过400字的内容”以及“至少三次提及AI关键词”等。若要加载该数据集，请运行以下代码： python from datasets import load_dataset ifeval = load_dataset("google/IFEval") ### 支持的任务与排行榜 IFEval数据集旨在评估聊天模型或经过指令微调的语言模型，是[Open LLM Leaderboard](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard)所使用的核心基准测试之一。 ### 语言 IFEval数据集的数据均为英语（BCP-47 标签为en）。 ## 数据集结构 ### 数据集实例训练拆分（train split）的一个示例如下： { "key": 1000, "prompt": 'Write a 300+ word summary of the wikipedia page "https://en.wikipedia.org/wiki/Raymond_III,_Count_of_Tripoli". Do not use any commas and highlight at least 3 sections that has titles in markdown format, for example *highlighted section part 1*, *highlighted section part 2*, *highlighted section part 3*.', "instruction_id_list": [ "punctuation:no_comma", "detectable_format:number_highlighted_sections", "length_constraints:number_words", ], "kwargs": [ { "num_highlights": None, "relation": None, "num_words": None, "num_placeholders": None, "prompt_to_repeat": None, "num_bullets": None, "section_spliter": None, "num_sections": None, "capital_relation": None, "capital_frequency": None, "keywords": None, "num_paragraphs": None, "language": None, "let_relation": None, "letter": None, "let_frequency": None, "end_phrase": None, "forbidden_words": None, "keyword": None, "frequency": None, "num_sentences": None, "postscript_marker": None, "first_word": None, "nth_paragraph": None, }, { "num_highlights": 3, "relation": None, "num_words": None, "num_placeholders": None, "prompt_to_repeat": None, "num_bullets": None, "section_spliter": None, "num_sections": None, "capital_relation": None, "capital_frequency": None, "keywords": None, "num_paragraphs": None, "language": None, "let_relation": None, "letter": None, "let_frequency": None, "end_phrase": None, "forbidden_words": None, "keyword": None, "frequency": None, "num_sentences": None, "postscript_marker": None, "first_word": None, "nth_paragraph": None, }, { "num_highlights": None, "relation": "at least", "num_words": 300, "num_placeholders": None, "prompt_to_repeat": None, "num_bullets": None, "section_spliter": None, "num_sections": None, "capital_relation": None, "capital_frequency": None, "keywords": None, "num_paragraphs": None, "language": None, "let_relation": None, "letter": None, "let_frequency": None, "end_phrase": None, "forbidden_words": None, "keyword": None, "frequency": None, "num_sentences": None, "postscript_marker": None, "first_word": None, "nth_paragraph": None, }, ], } ### 数据字段数据字段说明如下： * `key`：提示词的唯一标识符。 * `prompt`：描述模型需执行的任务。 * `instruction_id_list`：可验证指令的数组。完整的指令集及其说明请参见论文的表1。 * `kwargs`：用于为`instruction_id_list`中的每条可验证指令指定参数的数组。 ### 数据拆分 | 数据集名称 | 训练集样本数 | |:------------|------------:| | IFEval | 541 | ### 许可信息本数据集采用[Apache 2.0许可协议](https://www.apache.org/licenses/LICENSE-2.0)发布。 ### 引用信息 @misc{zhou2023instructionfollowingevaluationlargelanguage, title={Instruction-Following Evaluation for Large Language Models}, author={Jeffrey Zhou and Tianjian Lu and Swaroop Mishra and Siddhartha Brahma and Sujoy Basu and Yi Luan and Denny Zhou and Le Hou}, year={2023}, eprint={2311.07911}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2311.07911}, }

提供机构：

google

搜集汇总

数据集介绍

构建方式

在大语言模型能力评估的广阔领域中，指令遵循能力是衡量模型实用性的核心维度之一。为此，Google研究团队精心构建了IFEval（Instruction-Following Eval）基准数据集，旨在通过可验证的指令来客观评测模型的指令遵循水平。该数据集的构建基于一套精巧的设计理念：首先，研究团队从自然语言处理任务中提炼出约500条具有明确可验证标准的指令，例如“撰写超过400个单词的文本”或“至少三次提及‘人工智能’关键词”。这些指令被划分为不同类别，涵盖长度约束、格式要求、关键词提及、标点使用等多个方面。随后，每条指令被嵌入到具体的提示（prompt）中，形成完整的评测样本。数据集包含541个训练样本，每个样本由唯一标识符、提示文本、指令ID列表以及对应的参数字典构成，其中参数字典详细记录了每条指令的具体约束条件，如目标字数、关键词列表等，从而确保了指令的可量化验证。

特点

IFEval数据集的核心特点在于其独特的可验证性设计，这使其在众多模型评估基准中脱颖而出。与传统依赖人工判断或模型自评的评测方法不同，IFEval中的所有指令均可通过启发式规则进行自动验证，例如通过简单的文本长度统计、关键词计数或格式检测函数来判定模型输出是否满足指令要求。这种设计不仅大幅降低了评估成本，还消除了主观偏差，为模型能力的横向比较提供了客观、可复现的标准。此外，数据集覆盖了多样化的指令类型，从基础的文本长度控制到复杂的格式要求（如标记特定段落），全面考察模型在细粒度约束下的生成能力。作为Open LLM Leaderboard的核心基准之一，IFEval已成为评估和追踪大语言模型指令遵循能力演进的重要标尺，其简洁而严谨的评估框架推动了领域内研究的标准化进程。

使用方法

在实际应用中，研究人员可通过Hugging Face的datasets库便捷地加载IFEval数据集，仅需一行Python代码即可获取包含541个训练样本的集合。每个样本以字典形式存储，包含用于生成任务的prompt文本以及对应的指令ID列表和参数信息。使用该数据集时，典型的流程是：将数据集中的prompt作为输入提交给待评估的大语言模型，收集模型生成的回复文本，随后利用数据集自带的启发式验证规则（参考论文中的Table 1）逐一检查每条指令是否被满足。例如，对于包含“至少300个单词”指令的样本，可通过统计回复单词数来判定是否符合要求。最终，通过计算所有指令的满足率来量化模型的指令遵循性能。这一方法无需额外的标注或人工评审，使得大规模、自动化的模型评估成为可能，为模型迭代和对比提供了高效工具。

背景与挑战

背景概述

指令遵循能力是衡量大型语言模型（LLM）实用性的核心维度之一，然而传统评估基准多聚焦于知识问答或文本生成的质量，缺乏对模型能否精确执行复杂、多约束指令的系统性测试。在此背景下，由Google Research的Le Hou、Jeffrey Zhou等研究人员于2023年提出的IFEval（Instruction-Following Evaluation）数据集应运而生。该数据集包含约500条可验证的指令，每条指令均附带可被启发式规则自动校验的约束条件，如字数限制、关键词频率、格式要求等。IFEval的发布为LLM的指令遵循能力提供了首个标准化、可复现的量化评估框架，并迅速被纳入Open LLM Leaderboard核心基准，对推动模型在真实人机交互场景中的可靠性与可控性研究产生了深远影响。

当前挑战

IFEval所解决的领域问题在于，现有评估方法难以精确量化模型对指令的细粒度遵从程度，尤其是当指令包含多个并行约束时，模型常出现遗漏或偏差。该数据集通过设计可自动验证的指令（如“字数超过400字”或“至少提及AI关键词三次”），将主观的指令遵循能力转化为客观的二元判断，从而克服了传统人工评估成本高、一致性差的瓶颈。在构建过程中，挑战主要集中在指令集的覆盖度与歧义性平衡上：需确保每条指令的验证规则足够明确以避免误判，同时又要保留足够的多样性以反映真实用户指令的复杂组合。此外，如何设计指令间的正交性以避免冗余，以及如何确保启发式校验规则对模型输出格式的鲁棒性，也是开发团队面临的关键技术难点。

常用场景

经典使用场景

在自然语言处理领域，大语言模型的指令遵循能力是衡量其智能水平的关键维度。IFEval数据集应运而生，它包含约500条可验证的指令，如“撰写超过400字的回复”或“至少三次提及AI关键词”，这些指令可通过启发式规则自动校验。该数据集最经典的使用场景是作为基准测试，评估经过指令微调或对话优化的语言模型在遵循细粒度约束方面的表现。研究者通过模型对字数、格式、关键词频率等显式条件的遵从程度，量化其指令理解与执行能力，从而在Open LLM Leaderboard等平台上实现模型间的公平比较。

衍生相关工作

IFEval数据集催生了一系列具有影响力的后续工作。其核心思想——通过可验证指令评估模型——被Open LLM Leaderboard采纳为核心基准，驱动了如Llama、Mistral等系列模型的迭代评测。研究者在此基础上扩展出多语言指令遵循测试集，或结合对抗性样本探索模型鲁棒性。此外，该数据集启发了基于约束的微调方法，如通过强化学习优化指令遵从度，以及开发自动生成可验证指令的框架，进一步丰富了语言模型对齐研究的工具箱。

数据集最近研究