five

rifts

收藏
魔搭社区2025-12-05 更新2025-07-26 收录
下载链接:
https://modelscope.cn/datasets/microsoft/rifts
下载链接
链接失效反馈
官方服务:
资源简介:
## Navigating Rifts in Human-LLM Grounding: Study and Benchmark This is the dataset repository for the paper **Navigating Rifts in Human-LLM Grounding: Study and Benchmark** by [Omar Shaikh](https://oshaikh.com/), [Hussein Mozannar](https://husseinmozannar.github.io/), [Gagan Bansal](https://gagb.github.io/), [Adam Fourney](https://www.adamfourney.com/), and [Eric Horvitz](https://erichorvitz.com/) Feel free to reach out to [Omar Shaikh](https://oshaikh.com/) with any questions! [[Paper]](https://arxiv.org/abs/2311.09144) **If you're here for the source code, it's hosted on Github here!** [[Github]](https://github.com/microsoft/rifts/tree/main) ### *Abstract* Language models excel at following instructions but often struggle with the collaborative aspects of conversation that humans naturally employ. This limitation in grounding---the process by which conversation participants establish mutual understanding---can lead to outcomes ranging from frustrated users to serious consequences in high-stakes scenarios. To systematically study grounding challenges in human-LLM interactions, we analyze logs from three human-assistant datasets: WildChat, MultiWOZ, and Bing Chat. We develop a taxonomy of grounding acts and build models to annotate and forecast grounding behavior. Our findings reveal significant differences in human-human and human-LLM grounding: LLMs were three times less likely to initiate clarification and sixteen times less likely to provide follow-up requests than humans. Additionally, early grounding failures predicted later interaction breakdowns. Building on these insights, we introduce RIFTS: a benchmark derived from publicly available LLM interaction data containing situations where LLMs fail to initiate grounding. We note that current frontier models perform poorly on RIFTS, highlighting the need to reconsider how we train and prompt LLMs for human interaction. To this end, we develop a preliminary intervention that mitigates grounding failures. ### *Dataset Structure* This dataset contains examples with the following columns: - **instruction:** A prompt or instruction. - **split:** Indicates the data split (e.g., train). - **label:** The associated grounding label (e.g., none). - **logits:** A dictionary of logits values for different grounding acts from our pretrained forecaster. ### Example rows: - `convert rust String to clap::builder::Str` - `add this code to this code: @dp.callback_query...` - `give me an argumentative essay outline for poo...` - `spring security根据不同的角色访问不同的页面的代码是什么` ### *How do I cite this work?* Feel free to use the following BibTeX entry. **BibTeX:** ```tex @misc{shaikh2025navigatingriftshumanllmgrounding, title={Navigating Rifts in Human-LLM Grounding: Study and Benchmark}, author={Omar Shaikh and Hussein Mozannar and Gagan Bansal and Adam Fourney and Eric Horvitz}, year={2025}, eprint={2503.13975}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2503.13975}, } ```

# 探索人机与大语言模型锚定分歧:研究与基准测试集 本数据集仓库对应由[Omar Shaikh](https://oshaikh.com/)、[Hussein Mozannar](https://husseinmozannar.github.io/)、[Gagan Bansal](https://gagb.github.io/)、[Adam Fourney](https://www.adamfourney.com/)及[Eric Horvitz](https://erichorvitz.com/)共同发表的论文**《探索人机与大语言模型锚定分歧:研究与基准测试集》**(Navigating Rifts in Human-LLM Grounding: Study and Benchmark)。 如有任何疑问,请联系[Omar Shaikh](https://oshaikh.com/)。 [[论文链接]](https://arxiv.org/abs/2311.09144) **若您寻找本项目的源代码,可前往Github仓库获取:** [[Github仓库]](https://github.com/microsoft/rifts/tree/main) ## 摘要 大语言模型(Large Language Model,LLM)在遵循指令方面表现优异,但往往难以应对人类自然运用的对话协作环节。这种锚定(grounding)能力的缺陷——即对话参与者建立共同理解的过程——可能导致用户体验不佳,甚至在高风险场景中引发严重后果。为系统研究人机与大语言模型交互中的锚定挑战,我们分析了三类人机助手数据集的交互日志:WildChat、MultiWOZ及必应聊天(Bing Chat)。我们构建了锚定行为分类体系,并开发模型用于标注与预测锚定行为。研究结果显示,人机对话与人机-大语言模型对话的锚定行为存在显著差异:大语言模型发起澄清请求的概率仅为人类的三分之一,而提供后续询问的概率更是仅为人类的十六分之一。此外,早期锚定失败可预测后续交互中断。基于上述发现,我们推出RIFTS基准测试集:该数据集源自公开可获取的大语言模型交互数据,涵盖大语言模型未能发起锚定的场景。我们发现当前前沿大语言模型在RIFTS基准上表现不佳,这表明我们需要重新思考用于人类交互的大语言模型训练与提示方式。为此,我们开发了一项初步干预措施,可缓解锚定失败问题。 ## 数据集结构 本数据集包含以下字段的样本: - **instruction(指令)**:提示词或指令内容。 - **split(数据划分)**:指示数据所属的划分集(例如训练集)。 - **label(标签)**:关联的锚定标签(例如无锚定行为)。 - **logits(预测对数几率)**:来自我们预训练预测器的、针对不同锚定行为的对数几率字典。 ## 样本示例 - `将 Rust 字符串转换为 clap::builder::Str` - `将以下代码添加至该代码中:@dp.callback_query...` - `请为……撰写一篇议论文大纲` - `spring security根据不同的角色访问不同的页面的代码是什么` ## 如何引用本研究? 您可以使用以下BibTeX条目进行引用。 **BibTeX:** tex @misc{shaikh2025navigatingriftshumanllmgrounding, title={Navigating Rifts in Human-LLM Grounding: Study and Benchmark}, author={Omar Shaikh and Hussein Mozannar and Gagan Bansal and Adam Fourney and Eric Horvitz}, year={2025}, eprint={2503.13975}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2503.13975}, }
提供机构:
maas
创建时间:
2025-07-22
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作