five

Inverse_IFEval

收藏
魔搭社区2026-01-06 更新2025-09-06 收录
下载链接:
https://modelscope.cn/datasets/m-a-p/Inverse_IFEval
下载链接
链接失效反馈
官方服务:
资源简介:
# Overview Inverse IFEval is a novel benchmark designed to evaluate large language models' (LLMs) ability to follow counterintuitive instructions that deliberately deviate from conventional training paradigms. The dataset challenges models to override their ingrained training conventions and faithfully execute instructions that conflict with standard cognitive patterns or annotation norms. ![IFEval vs Inverse IFEval (1)_01.png](https://cdn-uploads.huggingface.co/production/uploads/66a9a55d7cda19fabeedbb89/YLnU2ppuahKxkDlz9etwR.png) # Key Features - **Counter-Cognitive Evaluation**: Measures models' ability to suppress training-induced biases and follow unconventional instructions - **Eight Instruction Types**: Systematically designed categories that invert standard training paradigms - **Multilingual Support**: Balanced Chinese and English versions (506 samples each) - **Diverse Domains**: Covers 23 knowledge domains including computer science, mathematics, law, and biology - **High-Quality Construction**: Multi-stage human-in-the-loop pipeline with expert verification # Dataset Composition The dataset contains 1,012 high-quality questions distributed across eight instruction types: | Instruction Type | Samples | Avg Q Length | Avg A Length | |------------------|---------|--------------|--------------| | Question Correction | 90 | 164.3 | 135.9 | | Intentional Textual Flaws | 86 | 254.0 | 306.7 | | Code without Comments | 198 | 555.5 | 1517.5 | | Counter-Conventional Formatting | 82 | 22.7 | 195.8 | | Deliberately Incorrect Answers | 186 | 343.2 | 296.3 | | Instructional Induction | 154 | 545.5 | 156.2 | | Mid-turn Instruction Modification | 108 | 2472.7 | 196.9 | | Counterfactual Answering | 108 | 647.2 | 183.0 | ![category_en2_01.png](https://cdn-uploads.huggingface.co/production/uploads/66a9a55d7cda19fabeedbb89/AaxGas-v1GSIOvfS-dysA.png) # Construction Methodology The dataset was built through a rigorous five-stage process: 1. **Observation & Reversal**: Analyze and invert standard SFT paradigms 2. **Seed Data Construction**: Expert-crafted seed questions 3. **Large-Scale Generation**: LLM-assisted question generation 4. **Automatic Filtering**: Quality control mechanisms 5. **Human Verification**: Expert review and calibration ![crop_data_construct_01.png](https://cdn-uploads.huggingface.co/production/uploads/66a9a55d7cda19fabeedbb89/8mJcXqFp8fbZy5oGqUidL.png) # Evaluation Protocol - Uses "LLM-as-a-Judge" paradigm with 98% accuracy - Adaptive judge model matrix for different instruction types - Optimized judging templates and system prompts - Scores range 0-100 based on instruction-following fidelity ## "LLM-as-a-Judge" pipeline Our automated evaluation framework, as illustrated in the figure, follows a structured "LLM-as-a-Judge" approach. The pipeline consists of the following steps: ![whiteboard_exported_image.png](https://cdn-uploads.huggingface.co/production/uploads/66a9a55d7cda19fabeedbb89/MQlZNBbtSiN2PnVXqgu4j.png) 1. Generation: An input prompt from the evaluation dataset is provided to the Model Under Evaluation (the target model being tested). The model generates a response based on this prompt. 2. Input Assembly for Judge: The generated response is then prepared for assessment. We construct a comprehensive input for the Judge Model by populating a Judge Prompt Template. This template integrates the model's response with a ground-truth Reference Answer and Scoring Criteria that outline the key points for a correct answer. 3. Evaluation: The assembled prompt, along with a high-level Judge System Prompt (which instructs the judge on its task, persona, and output format), is fed into the Judge Model. 4. Output: The Judge Model processes the inputs and delivers its verdict, which includes a numerical score and a textual Evaluation Rationale justifying the given score. ## Judge Model Matrix Here is the optimal judge model and judging template structure corresponding to various types of instructions: ![20250915-140514.jpg](https://cdn-uploads.huggingface.co/production/uploads/66a9a55d7cda19fabeedbb89/dq4QqAW62E_Zawqq81qvk.jpeg) # Key Findings - Top-performing model (o3-high) achieves 75.66 overall score - Models show significant performance drops (∼30%) on counterintuitive vs conventional instructions - Thinking mechanisms improve performance by ∼15% on average - Current alignment methods struggle with cognitive inertia # Intended Use - Evaluating instruction-following robustness - Testing model flexibility beyond training distributions - Identifying limitations in current alignment methods - Developing more adaptable LLMs # Citation When using this dataset, please cite the original work: ``` @misc{zhang2025inverseifevalllmsunlearn, title={Inverse IFEval: Can LLMs Unlearn Stubborn Training Conventions to Follow Real Instructions?}, author={Qinyan Zhang and Xinping Lei and Ruijie Miao and Yu Fu and Haojie Fan and Le Chang and Jiafan Hou and Dingling Zhang and Zhongfei Hou and Ziqiang Yang and Changxin Pu and Fei Hu and Jingkai Liu and Mengyun Liu and Yang Liu and Xiang Gao and Jiaheng Liu and Tong Yang and Zaiyuan Wang and Ge Zhang and Wenhao Huang}, year={2025}, eprint={2509.04292}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2509.04292}, } ```

# 概述 反向IFEval(Inverse IFEval)是一款新型基准测试集,旨在评估大语言模型(Large Language Model,LLM)遵循违背直觉指令的能力——这类指令刻意偏离了常规训练范式。该数据集要求模型克服其根深蒂固的训练惯例,忠实执行与标准认知模式或标注规范相悖的指令。 ![IFEval与反向IFEval对比图](https://cdn-uploads.huggingface.co/production/uploads/66a9a55d7cda19fabeedbb89/YLnU2ppuahKxkDlz9etwR.png) # 核心特性 - **反认知评估**:衡量模型抑制训练诱导偏差、遵循非常规指令的能力 - **八类指令类型**:系统设计的反转标准训练范式的指令类别 - **多语言支持**:包含均衡的中英双语版本(各506条样本) - **多元领域覆盖**:涵盖计算机科学、数学、法学、生物学等23个知识领域 - **高质量构建流程**:采用多阶段人机协同流水线,并经过专家验证 # 数据集构成 该数据集包含1012条高质量问题,分布于八类指令类型中: | 指令类型 | 样本数 | 问题平均长度 | 回答平均长度 | |------------------|---------|--------------|--------------| | 问题修正(Question Correction) | 90 | 164.3 | 135.9 | | 意图性文本缺陷(Intentional Textual Flaws) | 86 | 254.0 | 306.7 | | 无注释代码(Code without Comments) | 198 | 555.5 | 1517.5 | | 反常规格式(Counter-Conventional Formatting) | 82 | 22.7 | 195.8 | | 刻意错误回答(Deliberately Incorrect Answers) | 186 | 343.2 | 296.3 | | 指令归纳(Instructional Induction) | 154 | 545.5 | 156.2 | | 轮次中途修改指令(Mid-turn Instruction Modification) | 108 | 2472.7 | 196.9 | | 反事实回答(Counterfactual Answering) | 108 | 647.2 | 183.0 | ![指令类型分类图](https://cdn-uploads.huggingface.co/production/uploads/66a9a55d7cda19fabeedbb89/AaxGas-v1GSIOvfS-dysA.png) # 构建方法 该数据集通过严格的五阶段流程构建: 1. **观察与反转**:分析并反转标准监督微调(Supervised Fine-Tuning,SFT)范式 2. **种子数据构建**:由专家精心制作的种子问题 3. **大规模生成**:借助大语言模型辅助生成问题 4. **自动过滤**:质量控制机制 5. **人工验证**:专家评审与校准 ![数据集构建流程示意图](https://cdn-uploads.huggingface.co/production/uploads/66a9a55d7cda19fabeedbb89/8mJcXqFp8fbZy5oGqUidL.png) # 评估协议 - 采用准确率达98%的“大语言模型作为评判者(LLM-as-a-Judge)”范式 - 针对不同指令类型的自适应评判模型矩阵 - 优化后的评判模板与系统提示词 - 基于指令遵循忠实度的0-100分评分体系 ## “大语言模型作为评判者”流水线 我们的自动化评估框架如图所示,遵循结构化的“大语言模型作为评判者”方法。该流水线包含以下步骤: ![评估流水线示意图](https://cdn-uploads.huggingface.co/production/uploads/66a9a55d7cda19fabeedbb89/MQlZNBbtSiN2PnVXqgu4j.png) 1. **生成阶段**:将评估数据集中的输入提示词提供给待评估模型(即被测试的目标模型),模型基于该提示词生成回复。 2. **评判者输入组装**:将生成的回复准备用于评估。我们通过填充评判提示词模板,为评判模型构建全面的输入。该模板整合了模型生成的回复、真实参考答案以及列明正确答案核心要点的评分准则。 3. **评估阶段**:将组装好的提示词,以及指导评判者任务、角色与输出格式的高级别评判系统提示词,输入至评判模型。 4. **输出阶段**:评判模型处理输入并给出评判结果,包含数值评分与阐释该评分依据的文本评估理由。 ## 评判模型矩阵 以下是对应各类指令的最优评判模型与评判模板结构: ![评判模型矩阵示意图](https://cdn-uploads.huggingface.co/production/uploads/66a9a55d7cda19fabeedbb89/dq4QqAW62E_Zawqq81qvk.jpeg) # 核心发现 - 性能最优的模型(o3-high)整体得分达75.66 - 相较于常规指令,模型在违背直觉的指令上性能出现显著下滑(约30%) - 思维机制平均可提升约15%的性能 - 当前的对齐方法难以克服认知惯性 # 预期用途 - 评估指令遵循鲁棒性 - 测试模型超出训练分布的灵活性 - 识别当前对齐方法的局限性 - 开发更具适应性的大语言模型 # 引用方式 使用该数据集时,请引用以下原始文献: @misc{zhang2025inverseifevalllmsunlearn, title={Inverse IFEval: Can LLMs Unlearn Stubborn Training Conventions to Follow Real Instructions?}, author={Qinyan Zhang and Xinping Lei and Ruijie Miao and Yu Fu and Haojie Fan and Le Chang and Jiafan Hou and Dingling Zhang and Zhongfei Hou and Ziqiang Yang and Changxin Pu and Fei Hu and Jingkai Liu and Mengyun Liu and Yang Liu and Xiang Gao and Jiaheng Liu and Tong Yang and Zaiyuan Wang and Ge Zhang and Wenhao Huang}, year={2025}, eprint={2509.04292}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2509.04292}, }
提供机构:
maas
创建时间:
2025-09-03
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作