Multi-IF

Name: Multi-IF
Creator: maas
Published: 2026-05-15 15:21:40
License: 暂无描述

魔搭社区2026-05-15 更新2025-05-24 收录

下载链接：

https://modelscope.cn/datasets/facebook/Multi-IF

下载链接

链接失效反馈

官方服务：

资源简介：

### Dataset Summary We introduce Multi-IF, a new benchmark designed to assess LLMs' proficiency in following multi-turn and multilingual instructions. Multi-IF, which utilizes a hybrid framework combining LLM and human annotators, expands upon the IFEval by incorporating multi-turn sequences and translating the English prompts into another 7 languages, resulting in a dataset of 4501 multilingual conversations, where each has three turns. Our evaluation of 14 state-of-the-art LLMs on Multi-IF reveals that it presents a significantly more challenging task than existing benchmarks. All the models tested showed a higher rate of failure in executing instructions correctly with each additional turn. For example, o1-preview drops from 0.877 at the first turn to 0.707 at the third turn in terms of average accuracy over all languages. Moreover, languages with non-Latin scripts (Hindi, Russian, and Chinese) generally exhibit higher error rates, suggesting potential limitations in the models’ multilingual capabilities. ### Evaluation Script https://github.com/facebookresearch/Multi-IF ### Data Fields * `turns`: Placehold for saving the history conversation in evaluation. * `responses`: Placehold for saving the latest response in evaluation. * `turn_1_prompt`: The user prompt at the first turn, which is the input for LLM generation. * `turn_1_instruction_id_list`: The instructions of the user prompt at the first turn, which is needed in the evaluation script. * `turn_1_kwargs`: The arguments of the first turn instructions, which is needed in the evaluation script. * `turn_2_prompt`: The user prompt at the second turn, which is the input for LLM generation. * `turn_2_instruction_id_list`: The instructions of the user prompt at the second turn, which is needed in the evaluation script. * `turn_2_kwargs`: The arguments of the second turn instructions, which is needed in the evaluation script. * `turn_3_prompt`: The user prompt at the third turn, which is the input for LLM generation. * `turn_3_instruction_id_list`: The instructions of the user prompt at the third turn, which is needed in the evaluation script. * `turn_3_kwargs`: The arguments of the third turn instructions, which is needed in the evaluation script. * `key`: The key of each conversation * `turn_index`: Placehold for saving the current turn index in evaluation. * `language`: The language of each conversation ### Data Splits * test: 4,501 examples

### 数据集概述我们提出了Multi-IF基准测试，该基准旨在评估大语言模型（LLM）遵循多轮多语言指令的能力。Multi-IF采用大语言模型与人类标注者结合的混合框架，在IFEval基准的基础上进行扩展：不仅新增了多轮对话序列，还将英文提示词翻译为另外7种语言，最终构建了包含4501条三轮多语言对话的数据集。我们在Multi-IF上对14个当前顶尖的大语言模型开展了评估，结果表明该基准的任务难度显著高于现有同类基准。所有参与测试的模型均呈现出随对话轮次增加，正确执行指令的失败率持续升高的趋势。例如，o1-preview在所有语言上的平均准确率从第一轮的0.877下降至第三轮的0.707。此外，使用非拉丁字母的语言（印地语、俄语及中文）普遍表现出更高的错误率，这暗示当前大语言模型在多语言能力方面存在潜在局限。 ### 评估脚本 https://github.com/facebookresearch/Multi-IF ### 数据字段 * `turns`：用于存储评估过程中的对话历史（占位符）。 * `responses`：用于存储评估过程中的最新回复（占位符）。 * `turn_1_prompt`：第一轮对话的用户提示词，作为大语言模型生成的输入。 * `turn_1_instruction_id_list`：第一轮用户提示词对应的指令ID列表，评估脚本需使用该字段。 * `turn_1_kwargs`：第一轮指令的参数，评估脚本需使用该字段。 * `turn_2_prompt`：第二轮对话的用户提示词，作为大语言模型生成的输入。 * `turn_2_instruction_id_list`：第二轮用户提示词对应的指令ID列表，评估脚本需使用该字段。 * `turn_2_kwargs`：第二轮指令的参数，评估脚本需使用该字段。 * `turn_3_prompt`：第三轮对话的用户提示词，作为大语言模型生成的输入。 * `turn_3_instruction_id_list`：第三轮用户提示词对应的指令ID列表，评估脚本需使用该字段。 * `turn_3_kwargs`：第三轮指令的参数，评估脚本需使用该字段。 * `key`：每条对话的唯一标识键。 * `turn_index`：用于存储评估过程中的当前对话轮次索引（占位符）。 * `language`：每条对话使用的语言。 ### 数据划分 * 测试集：共4501条样本。

提供机构：

maas

创建时间：

2025-05-20

5,000+

优质数据集

54 个

任务类型

进入经典数据集