EIFBENCH
收藏arXiv2025-06-10 更新2025-11-28 收录
下载链接:
https://github.com/Hope-Rita/EIFBench
下载链接
链接失效反馈官方服务:
资源简介:
EIFBENCH是一个极其复杂指令遵循基准数据集,由阿里巴巴集团旗下通义实验室开发,旨在为大型语言模型(LLMs)提供一个更真实、更稳健的评价框架。该数据集包含多任务场景,能够同时评估不同任务类型的模型性能,并整合了多种约束条件,模拟复杂的操作环境。EIFBENCH由1000个实例组成,每个实例平均包含74.01个约束和8.24个指令,涵盖了多种任务类型,包括分类、信息提取、文本生成、对话系统、推理和逻辑、语言风格、评估和验证以及编程相关任务。数据集的创建过程包括任务和约束的分类、多场景数据收集、任务和约束的扩展、质量控制以及响应生成和评估等环节。EIFBENCH旨在解决现有基准测试在模拟真实世界复杂场景方面的不足,为LLMs提供更全面、更准确的评估工具。
EIFBENCH is an extremely complex instruction-following benchmark dataset developed by Tongyi Lab under Alibaba Group, aiming to provide a more realistic and robust evaluation framework for Large Language Models (LLMs). This dataset covers multi-task scenarios, enabling simultaneous evaluation of model performance across different task types, and integrates various constraints to simulate complex operational environments. EIFBENCH consists of 1000 instances, each containing an average of 74.01 constraints and 8.24 instructions, and covers a wide range of task types including classification, information extraction, text generation, dialogue systems, reasoning and logic, language style, evaluation and verification, as well as programming-related tasks. The dataset creation process includes steps such as task and constraint classification, multi-scenario data collection, task and constraint expansion, quality control, as well as response generation and evaluation. EIFBENCH aims to address the shortcomings of existing benchmarks in simulating real-world complex scenarios, providing a more comprehensive and accurate evaluation tool for LLMs.
提供机构:
阿里巴巴集团
创建时间:
2025-06-10



