Multi-IF
收藏魔搭社区2026-05-15 更新2025-05-24 收录
下载链接:
https://modelscope.cn/datasets/facebook/Multi-IF
下载链接
链接失效反馈官方服务:
资源简介:
### Dataset Summary
We introduce Multi-IF, a new benchmark designed to assess LLMs' proficiency in following multi-turn and multilingual instructions. Multi-IF, which utilizes a hybrid framework combining LLM and human annotators, expands upon the IFEval by incorporating multi-turn sequences and translating the English prompts into another 7 languages, resulting in a dataset of 4501 multilingual conversations, where each has three turns. Our evaluation of 14 state-of-the-art LLMs on Multi-IF reveals that it presents a significantly more challenging task than existing benchmarks. All the models tested showed a higher rate of failure in executing instructions correctly with each additional turn. For example, o1-preview drops from 0.877 at the first turn to 0.707 at the third turn in terms of average accuracy over all languages. Moreover, languages with non-Latin scripts (Hindi, Russian, and Chinese) generally exhibit higher error rates, suggesting potential limitations in the models’ multilingual capabilities.
### Evaluation Script
https://github.com/facebookresearch/Multi-IF
### Data Fields
* `turns`: Placehold for saving the history conversation in evaluation.
* `responses`: Placehold for saving the latest response in evaluation.
* `turn_1_prompt`: The user prompt at the first turn, which is the input for LLM generation.
* `turn_1_instruction_id_list`: The instructions of the user prompt at the first turn, which is needed in the evaluation script.
* `turn_1_kwargs`: The arguments of the first turn instructions, which is needed in the evaluation script.
* `turn_2_prompt`: The user prompt at the second turn, which is the input for LLM generation.
* `turn_2_instruction_id_list`: The instructions of the user prompt at the second turn, which is needed in the evaluation script.
* `turn_2_kwargs`: The arguments of the second turn instructions, which is needed in the evaluation script.
* `turn_3_prompt`: The user prompt at the third turn, which is the input for LLM generation.
* `turn_3_instruction_id_list`: The instructions of the user prompt at the third turn, which is needed in the evaluation script.
* `turn_3_kwargs`: The arguments of the third turn instructions, which is needed in the evaluation script.
* `key`: The key of each conversation
* `turn_index`: Placehold for saving the current turn index in evaluation.
* `language`: The language of each conversation
### Data Splits
* test: 4,501 examples
### 数据集概述
我们提出了Multi-IF基准测试,该基准旨在评估大语言模型(LLM)遵循多轮多语言指令的能力。Multi-IF采用大语言模型与人类标注者结合的混合框架,在IFEval基准的基础上进行扩展:不仅新增了多轮对话序列,还将英文提示词翻译为另外7种语言,最终构建了包含4501条三轮多语言对话的数据集。我们在Multi-IF上对14个当前顶尖的大语言模型开展了评估,结果表明该基准的任务难度显著高于现有同类基准。所有参与测试的模型均呈现出随对话轮次增加,正确执行指令的失败率持续升高的趋势。例如,o1-preview在所有语言上的平均准确率从第一轮的0.877下降至第三轮的0.707。此外,使用非拉丁字母的语言(印地语、俄语及中文)普遍表现出更高的错误率,这暗示当前大语言模型在多语言能力方面存在潜在局限。
### 评估脚本
https://github.com/facebookresearch/Multi-IF
### 数据字段
* `turns`:用于存储评估过程中的对话历史(占位符)。
* `responses`:用于存储评估过程中的最新回复(占位符)。
* `turn_1_prompt`:第一轮对话的用户提示词,作为大语言模型生成的输入。
* `turn_1_instruction_id_list`:第一轮用户提示词对应的指令ID列表,评估脚本需使用该字段。
* `turn_1_kwargs`:第一轮指令的参数,评估脚本需使用该字段。
* `turn_2_prompt`:第二轮对话的用户提示词,作为大语言模型生成的输入。
* `turn_2_instruction_id_list`:第二轮用户提示词对应的指令ID列表,评估脚本需使用该字段。
* `turn_2_kwargs`:第二轮指令的参数,评估脚本需使用该字段。
* `turn_3_prompt`:第三轮对话的用户提示词,作为大语言模型生成的输入。
* `turn_3_instruction_id_list`:第三轮用户提示词对应的指令ID列表,评估脚本需使用该字段。
* `turn_3_kwargs`:第三轮指令的参数,评估脚本需使用该字段。
* `key`:每条对话的唯一标识键。
* `turn_index`:用于存储评估过程中的当前对话轮次索引(占位符)。
* `language`:每条对话使用的语言。
### 数据划分
* 测试集:共4501条样本。
提供机构:
maas
创建时间:
2025-05-20



