IFEval
收藏魔搭社区2026-05-16 更新2025-04-26 收录
下载链接:
https://modelscope.cn/datasets/google/IFEval
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for IFEval
<!-- Provide a quick summary of the dataset. -->
## Dataset Description
- **Repository:** https://github.com/google-research/google-research/tree/master/instruction_following_eval
- **Paper:** https://huggingface.co/papers/2311.07911
- **Leaderboard:** https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard
- **Point of Contact:** [Le Hou](lehou@google.com)
### Dataset Summary
This dataset contains the prompts used in the [Instruction-Following Eval (IFEval) benchmark](https://arxiv.org/abs/2311.07911) for large language models. It contains around 500 "verifiable instructions" such as "write in more than 400 words" and "mention the keyword of AI at least 3 times" which can be verified by heuristics. To load the dataset, run:
```python
from datasets import load_dataset
ifeval = load_dataset("google/IFEval")
```
### Supported Tasks and Leaderboards
The IFEval dataset is designed for evaluating chat or instruction fine-tuned language models and is one of the core benchmarks used in the [Open LLM Leaderboard](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard).
### Languages
The data in IFEval are in English (BCP-47 en).
## Dataset Structure
### Data Instances
An example of the `train` split looks as follows:
```
{
"key": 1000,
"prompt": 'Write a 300+ word summary of the wikipedia page "https://en.wikipedia.org/wiki/Raymond_III,_Count_of_Tripoli". Do not use any commas and highlight at least 3 sections that has titles in markdown format, for example *highlighted section part 1*, *highlighted section part 2*, *highlighted section part 3*.',
"instruction_id_list": [
"punctuation:no_comma",
"detectable_format:number_highlighted_sections",
"length_constraints:number_words",
],
"kwargs": [
{
"num_highlights": None,
"relation": None,
"num_words": None,
"num_placeholders": None,
"prompt_to_repeat": None,
"num_bullets": None,
"section_spliter": None,
"num_sections": None,
"capital_relation": None,
"capital_frequency": None,
"keywords": None,
"num_paragraphs": None,
"language": None,
"let_relation": None,
"letter": None,
"let_frequency": None,
"end_phrase": None,
"forbidden_words": None,
"keyword": None,
"frequency": None,
"num_sentences": None,
"postscript_marker": None,
"first_word": None,
"nth_paragraph": None,
},
{
"num_highlights": 3,
"relation": None,
"num_words": None,
"num_placeholders": None,
"prompt_to_repeat": None,
"num_bullets": None,
"section_spliter": None,
"num_sections": None,
"capital_relation": None,
"capital_frequency": None,
"keywords": None,
"num_paragraphs": None,
"language": None,
"let_relation": None,
"letter": None,
"let_frequency": None,
"end_phrase": None,
"forbidden_words": None,
"keyword": None,
"frequency": None,
"num_sentences": None,
"postscript_marker": None,
"first_word": None,
"nth_paragraph": None,
},
{
"num_highlights": None,
"relation": "at least",
"num_words": 300,
"num_placeholders": None,
"prompt_to_repeat": None,
"num_bullets": None,
"section_spliter": None,
"num_sections": None,
"capital_relation": None,
"capital_frequency": None,
"keywords": None,
"num_paragraphs": None,
"language": None,
"let_relation": None,
"letter": None,
"let_frequency": None,
"end_phrase": None,
"forbidden_words": None,
"keyword": None,
"frequency": None,
"num_sentences": None,
"postscript_marker": None,
"first_word": None,
"nth_paragraph": None,
},
],
}
```
### Data Fields
The data fields are as follows:
* `key`: A unique ID for the prompt.
* `prompt`: Describes the task the model should perform.
* `instruction_id_list`: An array of verifiable instructions. See Table 1 of the paper for the full set with their descriptions.
* `kwargs`: An array of arguments used to specify each verifiable instruction in `instruction_id_list`.
### Data Splits
| | train |
|---------------|------:|
| IFEval | 541 |
### Licensing Information
The dataset is available under the [Apache 2.0 license](https://www.apache.org/licenses/LICENSE-2.0).
### Citation Information
```
@misc{zhou2023instructionfollowingevaluationlargelanguage,
title={Instruction-Following Evaluation for Large Language Models},
author={Jeffrey Zhou and Tianjian Lu and Swaroop Mishra and Siddhartha Brahma and Sujoy Basu and Yi Luan and Denny Zhou and Le Hou},
year={2023},
eprint={2311.07911},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2311.07911},
}
```
# IFEval 数据集卡片
<!-- 为本数据集提供简要概述。 -->
## 数据集描述
- **仓库地址:** https://github.com/google-research/google-research/tree/master/instruction_following_eval
- **论文地址:** https://huggingface.co/papers/2311.07911
- **排行榜地址:** https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard
- **联系方式:** [Le Hou](lehou@google.com)
### 数据集概述
本数据集收录了用于大语言模型(Large Language Model, LLM)的[指令遵循评估(Instruction-Following Eval, IFEval)基准测试](https://arxiv.org/abs/2311.07911)所使用的提示词。数据集包含约500条可通过启发式规则验证的"可验证指令",例如"撰写超过400字的内容"与"至少三次提及AI关键词"等。加载该数据集的代码如下:
python
from datasets import load_dataset
ifeval = load_dataset("google/IFEval")
### 支持任务与排行榜
IFEval数据集专为评估聊天模型或经过指令微调的语言模型而设计,是[Open LLM Leaderboard](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard)所采用的核心基准测试之一。
### 语言
IFEval数据集采用英语(BCP-47 标签为 en)。
## 数据集结构
### 数据实例
`train` 划分下的示例数据如下:
{
"key": 1000,
"prompt": 'Write a 300+ word summary of the wikipedia page "https://en.wikipedia.org/wiki/Raymond_III,_Count_of_Tripoli". Do not use any commas and highlight at least 3 sections that has titles in markdown format, for example *highlighted section part 1*, *highlighted section part 2*, *highlighted section part 3".',
"instruction_id_list": [
"punctuation:no_comma",
"detectable_format:number_highlighted_sections",
"length_constraints:number_words",
],
"kwargs": [
{
"num_highlights": None,
"relation": None,
"num_words": None,
"num_placeholders": None,
"prompt_to_repeat": None,
"num_bullets": None,
"section_spliter": None,
"num_sections": None,
"capital_relation": None,
"capital_frequency": None,
"keywords": None,
"num_paragraphs": None,
"language": None,
"let_relation": None,
"letter": None,
"let_frequency": None,
"end_phrase": None,
"forbidden_words": None,
"keyword": None,
"frequency": None,
"num_sentences": None,
"postscript_marker": None,
"first_word": None,
"nth_paragraph": None,
},
{
"num_highlights": 3,
"relation": None,
"num_words": None,
"num_placeholders": None,
"prompt_to_repeat": None,
"num_bullets": None,
"section_spliter": None,
"num_sections": None,
"capital_relation": None,
"capital_frequency": None,
"keywords": None,
"num_paragraphs": None,
"language": None,
"let_relation": None,
"letter": None,
"let_frequency": None,
"end_phrase": None,
"forbidden_words": None,
"keyword": None,
"frequency": None,
"num_sentences": None,
"postscript_marker": None,
"first_word": None,
"nth_paragraph": None,
},
{
"num_highlights": None,
"relation": "at least",
"num_words": 300,
"num_placeholders": None,
"prompt_to_repeat": None,
"num_bullets": None,
"section_spliter": None,
"num_sections": None,
"capital_relation": None,
"capital_frequency": None,
"keywords": None,
"num_paragraphs": None,
"language": None,
"let_relation": None,
"letter": None,
"let_frequency": None,
"end_phrase": None,
"forbidden_words": None,
"keyword": None,
"frequency": None,
"num_sentences": None,
"postscript_marker": None,
"first_word": None,
"nth_paragraph": None,
},
],
}
### 数据字段
各数据字段说明如下:
* `key`:提示词的唯一标识符。
* `prompt`:描述模型需执行的任务。
* `instruction_id_list`:可验证指令的数组。完整指令集及其说明详见论文的表1。
* `kwargs`:用于指定`instruction_id_list`中每条可验证指令的参数数组。
### 数据划分
| | 训练集 |
|---------------|------:|
| IFEval | 541 |
### 授权信息
本数据集遵循[Apache 2.0开源协议](https://www.apache.org/licenses/LICENSE-2.0)发布。
### 引用信息
@misc{zhou2023instructionfollowingevaluationlargelanguage,
title={大语言模型的指令遵循评估},
author={Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, Le Hou},
year={2023},
eprint={2311.07911},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2311.07911},
}
提供机构:
maas
创建时间:
2025-04-21
搜集汇总
数据集介绍

背景与挑战
背景概述
IFEval是一个包含约500条可验证指令的数据集,用于评估大型语言模型的指令遵循能力,支持启发式验证,是Open LLM Leaderboard的核心基准之一。数据以英语为主,结构清晰,适用于聊天或指令微调的语言模型评估。
以上内容由遇见数据集搜集并总结生成



