KikiNLP/EvolIF
收藏Hugging Face2026-04-21 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/KikiNLP/EvolIF
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-nc-4.0
task_categories:
- text-generation
- question-answering
language:
- en
tags:
- llm
- benchmark
- multi-turn
- dialogue
- instruction-following
size_categories:
- 1K<n<10K
---
<div align= "center">
<h1 align="center"><i> [ACL2026] One Battle After Another:</i><br> Probing LLMs' Limits of Multi-Turn Instruction Following with an Evolving Benchmark</h1>
</div>
<div align= "center">
<p>
<a href="https://arxiv.org/abs/2511.03508v2">📖 Arxiv</a> |
<a href="https://github.com/JiaQiSJTU/EvolIF">🛠️ Code</a> |
<a href="https://huggingface.co/datasets/KikiNLP/EvolIF">🤗 EvolIF Dataset</a>
</p>
</div>
# Introduction
Evaluating LLMs’ instruction-following ability in multi-topic dialogues is essential yet challenging. Existing benchmarks are limited to a fixed number of turns, susceptible to saturation and failing to account for users’ interactive experience. In this work, we propose a novel framework featuring a three-layer tracking mechanism and a query synthesis agent to mimic sequential user behaviors. Grounded in Flow Theory, we introduce process-centric metrics and terminate a conversational evaluation only upon exhausting user patience. Leveraging this framework, we present EvolIF, an evolving benchmark covering 12 constraint groups. Our analysis reveals deficiencies in failure recovery and fine-grained instruction following, with performance stratification becoming evident as conversational depth increases. GPT-5 demonstrates the most sustained resilience, maintaining a 66.40% stability score, outperforming Gemini-3-Pro by 5.59%, while other models lag behind.
# Data
EvolIF is the general-domain realization of the framework. Each file in `dialog_v0.1` is in JSON Lines (`.jsonl`) format, with one JSON object per line. Each line corresponds to one conversational turn in an evolving, multi-topic dialogue.
## Record Format
Each line in the released files contains a single turn-level record with the following top-level fields:
- `turn`: Turn index within the dialogue.
- `active_topic`: Integer identifier of the topic active at the current turn.
- `user_query`: Raw user utterance produced by the query-synthesis pipeline.
- `user_query_verified`: Verified user utterance.
- `instructions`: List of structured constraints that should be satisfied by the model response at this turn. Each element is an object with the following fields:
- `id`: Identifier of the constraint family, such as `format`, `length`, or `forbidden`.
- `args`: JSON-serializable parameters for the constraint instance, such as modes and thresholds.
- `description`: Natural-language description of the constraint.
- `style`: Structured style/persona bundle associated with the session, including:
- `uuid`: Stable identifier for the associated persona/style configuration used in the session.
- `persona`: Short persona description.
- `styles`: List of stylistic descriptors, such as tone or register cues, associated with the persona.
- `instruction_success`: Boolean flag indicating whether the turn's instruction stack passed the construction and verification pipeline.
- `topic_success`: Boolean flag indicating whether topic-level requirements were satisfied during construction and verification.
# Usage
The full benchmark pipeline, including state evolution, dialogue synthesis, constraint registration, and end-to-end evaluation/scoring, is implemented in the accompanying code repository: [EvolIF](https://github.com/JiaQiSJTU/EvolIF). See `Readme.md` there for setup and usage details.
# Citation
```
@misc{jia2025battleanotherprobingllms,
title={One Battle After Another: Probing LLMs' Limits on Multi-Turn Instruction Following with a Benchmark Evolving Framework},
author={Qi Jia and Kaiwei Zhang and Xiujie Song and Ye Shen and Xiangyang Zhu and Guangtao Zhai},
year={2025},
eprint={2511.03508v2},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2511.03508v2},
}
```
许可证:CC BY-NC 4.0
任务类别:
- 文本生成
- 问答
语言:
- 英语
标签:
- 大语言模型(LLM)
- 基准测试
- 多轮对话
- 指令遵循
样本规模区间:1000 < n < 10000
<div align="center">
<h1 align="center"><i> [ACL2026] 一场接一场的较量:基于演化基准测试探究大语言模型的多轮指令遵循极限</i><br></h1>
</div>
<div align="center">
<p>
<a href="https://arxiv.org/abs/2511.03508v2">📖 arXiv论文</a> |
<a href="https://github.com/JiaQiSJTU/EvolIF">🛠️ 代码仓库</a> |
<a href="https://huggingface.co/datasets/KikiNLP/EvolIF">🤗 EvolIF 数据集</a>
</p>
</div>
# 引言
评估大语言模型(LLM)在多主题对话中的指令遵循能力至关重要却极具挑战。现有基准测试均局限于固定轮次,易出现性能饱和问题,且未充分考量用户的交互体验。本研究提出一种新颖框架,其具备三层跟踪机制与查询合成智能体,以模拟用户的连续交互行为。基于心流理论(Flow Theory),我们引入以过程为核心的评估指标,仅当用户耐心耗尽时才终止对话评估。依托该框架,我们构建了EvolIF——一个涵盖12个约束组的演化式基准测试。我们的分析揭示了现有模型在故障恢复与细粒度指令遵循方面的不足,且随着对话深度增加,模型性能分层现象愈发显著。GPT-5展现出最强的持续抗干扰能力,稳定性得分达66.40%,较Gemini-3-Pro高出5.59%,其余模型则表现落后。
# 数据集概况
EvolIF是该框架在通用领域的实现。`dialog_v0.1`目录下的每个文件均采用JSON Lines(.jsonl)格式,每行对应一个JSON对象,代表一轮演化式多主题对话的交互记录。
## 记录格式
发布文件中的每行均为单轮级别的记录,包含以下顶级字段:
- `turn`:对话内的轮次索引
- `active_topic`:当前轮次活跃主题的整数标识符
- `user_query`:由查询合成管道生成的原始用户查询
- `user_query_verified`:经校验的用户查询
- `instructions`:该轮次模型响应需满足的结构化约束列表,每个元素为包含以下字段的对象:
- `id`:约束族的标识符,例如`format`(格式)、`length`(长度)或`forbidden`(禁止项)
- `args`:约束实例的JSON可序列化参数,例如模式与阈值
- `description`:约束的自然语言描述
- `style`:与会话关联的结构化风格/人设包,包含:
- `uuid`:会话中所用的人设/风格配置的稳定标识符
- `persona`:简短的人设描述
- `styles`:与人设关联的风格描述符列表,例如语气或语体提示
- `instruction_success`:布尔标记,用于指示本轮的指令栈是否通过了构建与校验管道
- `topic_success`:布尔标记,用于指示构建与校验阶段是否满足主题级要求
# 使用方法
完整的基准测试流水线,包括状态演化、对话合成、约束注册与端到端评估/评分,均已在配套代码仓库[EvolIF](https://github.com/JiaQiSJTU/EvolIF)中实现。详见其中的`Readme.md`文件以了解部署与使用细节。
# 引用
bibtex
@misc{jia2025battleanotherprobingllms,
title={One Battle After Another: Probing LLMs' Limits on Multi-Turn Instruction Following with a Benchmark Evolving Framework},
author={Qi Jia and Kaiwei Zhang and Xiujie Song and Ye Shen and Xiangyang Zhu and Guangtao Zhai},
year={2025},
eprint={2511.03508v2},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2511.03508v2},
}
提供机构:
KikiNLP



