five

IFEval

收藏
魔搭社区2026-05-16 更新2025-04-26 收录
下载链接:
https://modelscope.cn/datasets/google/IFEval
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Card for IFEval <!-- Provide a quick summary of the dataset. --> ## Dataset Description - **Repository:** https://github.com/google-research/google-research/tree/master/instruction_following_eval - **Paper:** https://huggingface.co/papers/2311.07911 - **Leaderboard:** https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard - **Point of Contact:** [Le Hou](lehou@google.com) ### Dataset Summary This dataset contains the prompts used in the [Instruction-Following Eval (IFEval) benchmark](https://arxiv.org/abs/2311.07911) for large language models. It contains around 500 "verifiable instructions" such as "write in more than 400 words" and "mention the keyword of AI at least 3 times" which can be verified by heuristics. To load the dataset, run: ```python from datasets import load_dataset ifeval = load_dataset("google/IFEval") ``` ### Supported Tasks and Leaderboards The IFEval dataset is designed for evaluating chat or instruction fine-tuned language models and is one of the core benchmarks used in the [Open LLM Leaderboard](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard). ### Languages The data in IFEval are in English (BCP-47 en). ## Dataset Structure ### Data Instances An example of the `train` split looks as follows: ``` { "key": 1000, "prompt": 'Write a 300+ word summary of the wikipedia page "https://en.wikipedia.org/wiki/Raymond_III,_Count_of_Tripoli". Do not use any commas and highlight at least 3 sections that has titles in markdown format, for example *highlighted section part 1*, *highlighted section part 2*, *highlighted section part 3*.', "instruction_id_list": [ "punctuation:no_comma", "detectable_format:number_highlighted_sections", "length_constraints:number_words", ], "kwargs": [ { "num_highlights": None, "relation": None, "num_words": None, "num_placeholders": None, "prompt_to_repeat": None, "num_bullets": None, "section_spliter": None, "num_sections": None, "capital_relation": None, "capital_frequency": None, "keywords": None, "num_paragraphs": None, "language": None, "let_relation": None, "letter": None, "let_frequency": None, "end_phrase": None, "forbidden_words": None, "keyword": None, "frequency": None, "num_sentences": None, "postscript_marker": None, "first_word": None, "nth_paragraph": None, }, { "num_highlights": 3, "relation": None, "num_words": None, "num_placeholders": None, "prompt_to_repeat": None, "num_bullets": None, "section_spliter": None, "num_sections": None, "capital_relation": None, "capital_frequency": None, "keywords": None, "num_paragraphs": None, "language": None, "let_relation": None, "letter": None, "let_frequency": None, "end_phrase": None, "forbidden_words": None, "keyword": None, "frequency": None, "num_sentences": None, "postscript_marker": None, "first_word": None, "nth_paragraph": None, }, { "num_highlights": None, "relation": "at least", "num_words": 300, "num_placeholders": None, "prompt_to_repeat": None, "num_bullets": None, "section_spliter": None, "num_sections": None, "capital_relation": None, "capital_frequency": None, "keywords": None, "num_paragraphs": None, "language": None, "let_relation": None, "letter": None, "let_frequency": None, "end_phrase": None, "forbidden_words": None, "keyword": None, "frequency": None, "num_sentences": None, "postscript_marker": None, "first_word": None, "nth_paragraph": None, }, ], } ``` ### Data Fields The data fields are as follows: * `key`: A unique ID for the prompt. * `prompt`: Describes the task the model should perform. * `instruction_id_list`: An array of verifiable instructions. See Table 1 of the paper for the full set with their descriptions. * `kwargs`: An array of arguments used to specify each verifiable instruction in `instruction_id_list`. ### Data Splits | | train | |---------------|------:| | IFEval | 541 | ### Licensing Information The dataset is available under the [Apache 2.0 license](https://www.apache.org/licenses/LICENSE-2.0). ### Citation Information ``` @misc{zhou2023instructionfollowingevaluationlargelanguage, title={Instruction-Following Evaluation for Large Language Models}, author={Jeffrey Zhou and Tianjian Lu and Swaroop Mishra and Siddhartha Brahma and Sujoy Basu and Yi Luan and Denny Zhou and Le Hou}, year={2023}, eprint={2311.07911}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2311.07911}, } ```

# IFEval 数据集卡片 <!-- 为本数据集提供简要概述。 --> ## 数据集描述 - **仓库地址:** https://github.com/google-research/google-research/tree/master/instruction_following_eval - **论文地址:** https://huggingface.co/papers/2311.07911 - **排行榜地址:** https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard - **联系方式:** [Le Hou](lehou@google.com) ### 数据集概述 本数据集收录了用于大语言模型(Large Language Model, LLM)的[指令遵循评估(Instruction-Following Eval, IFEval)基准测试](https://arxiv.org/abs/2311.07911)所使用的提示词。数据集包含约500条可通过启发式规则验证的"可验证指令",例如"撰写超过400字的内容"与"至少三次提及AI关键词"等。加载该数据集的代码如下: python from datasets import load_dataset ifeval = load_dataset("google/IFEval") ### 支持任务与排行榜 IFEval数据集专为评估聊天模型或经过指令微调的语言模型而设计,是[Open LLM Leaderboard](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard)所采用的核心基准测试之一。 ### 语言 IFEval数据集采用英语(BCP-47 标签为 en)。 ## 数据集结构 ### 数据实例 `train` 划分下的示例数据如下: { "key": 1000, "prompt": 'Write a 300+ word summary of the wikipedia page "https://en.wikipedia.org/wiki/Raymond_III,_Count_of_Tripoli". Do not use any commas and highlight at least 3 sections that has titles in markdown format, for example *highlighted section part 1*, *highlighted section part 2*, *highlighted section part 3".', "instruction_id_list": [ "punctuation:no_comma", "detectable_format:number_highlighted_sections", "length_constraints:number_words", ], "kwargs": [ { "num_highlights": None, "relation": None, "num_words": None, "num_placeholders": None, "prompt_to_repeat": None, "num_bullets": None, "section_spliter": None, "num_sections": None, "capital_relation": None, "capital_frequency": None, "keywords": None, "num_paragraphs": None, "language": None, "let_relation": None, "letter": None, "let_frequency": None, "end_phrase": None, "forbidden_words": None, "keyword": None, "frequency": None, "num_sentences": None, "postscript_marker": None, "first_word": None, "nth_paragraph": None, }, { "num_highlights": 3, "relation": None, "num_words": None, "num_placeholders": None, "prompt_to_repeat": None, "num_bullets": None, "section_spliter": None, "num_sections": None, "capital_relation": None, "capital_frequency": None, "keywords": None, "num_paragraphs": None, "language": None, "let_relation": None, "letter": None, "let_frequency": None, "end_phrase": None, "forbidden_words": None, "keyword": None, "frequency": None, "num_sentences": None, "postscript_marker": None, "first_word": None, "nth_paragraph": None, }, { "num_highlights": None, "relation": "at least", "num_words": 300, "num_placeholders": None, "prompt_to_repeat": None, "num_bullets": None, "section_spliter": None, "num_sections": None, "capital_relation": None, "capital_frequency": None, "keywords": None, "num_paragraphs": None, "language": None, "let_relation": None, "letter": None, "let_frequency": None, "end_phrase": None, "forbidden_words": None, "keyword": None, "frequency": None, "num_sentences": None, "postscript_marker": None, "first_word": None, "nth_paragraph": None, }, ], } ### 数据字段 各数据字段说明如下: * `key`:提示词的唯一标识符。 * `prompt`:描述模型需执行的任务。 * `instruction_id_list`:可验证指令的数组。完整指令集及其说明详见论文的表1。 * `kwargs`:用于指定`instruction_id_list`中每条可验证指令的参数数组。 ### 数据划分 | | 训练集 | |---------------|------:| | IFEval | 541 | ### 授权信息 本数据集遵循[Apache 2.0开源协议](https://www.apache.org/licenses/LICENSE-2.0)发布。 ### 引用信息 @misc{zhou2023instructionfollowingevaluationlargelanguage, title={大语言模型的指令遵循评估}, author={Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, Le Hou}, year={2023}, eprint={2311.07911}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2311.07911}, }
提供机构:
maas
创建时间:
2025-04-21
搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
IFEval是一个包含约500条可验证指令的数据集,用于评估大型语言模型的指令遵循能力,支持启发式验证,是Open LLM Leaderboard的核心基准之一。数据以英语为主,结构清晰,适用于聊天或指令微调的语言模型评估。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作