HalluMix
收藏魔搭社区2025-08-08 更新2025-08-02 收录
下载链接:
https://modelscope.cn/datasets/quotientai/HalluMix
下载链接
链接失效反馈官方服务:
资源简介:
# Introducing HalluMix: A Task-Agnostic, Multi-Domain Benchmark for Detecting Hallucinations in Real-World Scenarios
✉️ **Contact:** {deanna, mike, freddie, julia}@quotientai.co \
📜 **Paper:** [_HalluMix: A Task-Agnostic, Multi-Domain Benchmark for Real-World Hallucination Detection_, Emery et al (2025)](https://arxiv.org/abs/2505.00506)
As large language models (LLMs) are increasingly adopted in critical industries, ensuring their outputs are factually grounded has emerged as a major concern. One prominent issue is "hallucination," where models generate content unsupported by or contrary to the provided evidence. Existing hallucination detection benchmarks are often limited, synthetic, or narrowly focused on specific tasks like question-answering. Recognizing this gap, we developed [`HalluMix`](https://arxiv.org/abs/2505.00506): a task-agnostic, multi-domain benchmark designed to evaluate hallucination detection in realistic, diverse contexts.
## Why HalluMix?
Traditional benchmarks fall short because they rarely capture the complexity of real-world scenarios, where multi-sentence outputs must be evaluated against multi-document contexts. `HalluMix` addresses this limitation by including examples from various domains (healthcare, law, science, and news) and multiple tasks (summarization, question answering, natural language inference). Each example in `HalluMix` contains:
- **Documents:** Context represented as a list of shuffled text chunks (e.g., tokenized sentences or paragraph blocks) with random, irrelevant document chunks from unrelated documents. This mimics real-world Retrieval Augmented Generation (RAG) scenarios.
- **Answer:** The hypothesis to be evaluated, such as a summary sentence, answer, or claim.
- **Hallucination Label:** A binary indicator marking whether the response contains a hallucination.
- **Source Identifier:** A label for the original dataset for provenance tracking.
To closely simulate retrieval noise encountered in practical applications, `HalluMix` introduces distractors into the context of faithful examples, increasing evaluation complexity without compromising data validity.
## Building HalluMix
`HalluMix` integrates high-quality human-curated datasets through careful transformations:
- **Natural Language Inference (NLI)** datasets ([sentence-transformers/all-nli](https://huggingface.co/datasets/sentence-transformers/all-nli), [stanfordnlp/snli](https://huggingface.co/datasets/stanfordnlp/snli), [snli-hard](https://huggingface.co/datasets/au123/snli-hard), [GLUE: mnli, rte, wnli](https://huggingface.co/datasets/nyu-mll/glue)) were adapted by mapping "entailment" labels as faithful and "neutral/contradiction" as hallucinated.
- **Summarization** datasets ([sentence-transformers/altlex](https://huggingface.co/datasets/sentence-transformers/altlex), [CNN/DailyMail](https://huggingface.co/datasets/abisee/cnn_dailymail), [DialogSum](https://huggingface.co/datasets/knkarthick/dialogsum), [XSum](https://huggingface.co/datasets/EdinburghNLP/xsum), [arXiv summarization](https://huggingface.co/datasets/ccdv/arxiv-summarization), [GovReport summarization](https://huggingface.co/datasets/ccdv/govreport-summarization), [PubMed summarization](https://huggingface.co/datasets/ccdv/pubmed-summarization)) were transformed by mismatching summaries with unrelated documents to generate hallucinated instances.
- **Question Answering (QA)** datasets ([SQuAD-v2](https://huggingface.co/datasets/rajpurkar/squad_v2), [DROP](https://huggingface.co/datasets/ucinlp/drop), [Databricks-Dolly-15K](https://huggingface.co/datasets/databricks/databricks-dolly-15k), [PubMedQA](https://huggingface.co/datasets/qiaojin/PubMedQA), [NarrativeQA](https://huggingface.co/datasets/deepmind/narrativeqa)) included context-answer mismatches, LLM-generated plausible but incorrect answers, and converted single-word answers into declarative sentences to ensure realism.

This rigorous methodology resulted in a balanced, diverse dataset of 6,500 examples across multiple tasks and domains, enabling broad and robust evaluation.
## Evaluating Detection Systems with HalluMix
Using `HalluMix`, we evaluated seven leading hallucination detection systems, both open- and closed-source, revealing significant insights:
- **Quotient Detections** achieved the best overall performance (Accuracy: 0.82, F1 score: 0.84), showing balanced precision and recall.
- [**Azure Groundedness**](https://learn.microsoft.com/en-us/azure/ai-services/content-safety/quickstart-groundedness?tabs=curl&pivots=programming-language-foundry-portal) demonstrated high precision but lower recall, whereas [**Ragas Faithfulness**](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/faithfulness/) had high recall at the expense of precision.

- System performance varied notably with content length and task type. Models fine-tuned on long contexts (e.g., [Patronus Lynx 8B](https://huggingface.co/PatronusAI/Llama-3-Patronus-Lynx-8B-Instruct)) excelled in summarization tasks but faltered on shorter NLI or QA tasks. Conversely, sentence-based detectors (Quotient Detections and [Bespoke-Minicheck-7B](https://huggingface.co/bespokelabs/Bespoke-MiniCheck-7B)) performed exceptionally on short contexts but struggled with long-form content.

## Key Findings and Implications
Our analysis highlighted several critical takeaways:
- **Sub-source Overfitting:** Some detection systems appear overly tuned to specific datasets, indicating limited generalizability.
- **Content-Length Challenges:** Effective hallucination detection heavily depends on handling context length and preserving inter-sentence coherence.
- **Architectural Trade-offs:** Continuous-context methods offer strong performance on longer texts, whereas sentence-level methods excel at precise short-context detection but lose context in longer documents.
## Toward Robust, Real-World Detection
Future research must focus on combining the strengths of both approaches—perhaps through hierarchical or sliding-window contexts—to ensure reliable detection across various input formats and lengths. By openly releasing `HalluMix`, we hope to encourage further innovation in creating robust hallucination detection tools, critical for deploying trustworthy LLM applications.
With `HalluMix`, we're taking an essential step toward addressing one of AI's most pressing challenges—ensuring factual correctness and trustworthiness in practical deployments.
## Citation
If you find HalluMix useful, please consider citing our paper:
```
@article{emery2025hallumix,
title={HalluMix: A Task-Agnostic, Multi-Domain Benchmark for Real-World Hallucination Detection},
author={Deanna Emery and Michael Goitia and Freddie Vargus and Iulia Neagu},
year={2025},
journal={arXiv preprint arXiv:2505.00506},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2505.00506},
}
```
## Appendix
**Table 1: Example of a hallucinated response in HalluMix**
| | |
|----------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **Documents** | • Due to the Steelers’ loss to the Ravens the previous day, the Bengals entered the game as the AFC North champions. The Bengals rushed out to a 14-0 lead in the first half on a McCarron touchdown pass and a Mohamed Sanu rush, but Denver cut the deficit to 11 points as Brandon McManus nailed a short 23-yard field goal with just 18 seconds remaining before halftime. In the second half, momentum shifted mightily after a missed field goal by Mike Nugent in the third. Emmanuel Sanders hauled in an 8-yard pass from Brock Osweiler to cut the deficit to 14-10, and Denver claimed the lead for the first time in the game on a 39-yard touchdown run by C.J. Anderson with 11:17 remaining in the 4th Quarter. The Bengals marched down the field to tie the game on Mike Nugent’s season-long 52-yard field goal, making the score 17-17 at the end of regulation. The tired Bengals failed to put any points on the board in the extra period, allowing a 37-yard McManus field goal to make the score 20-17 Denver. A botched snap on the ensuing Bengals drive was recovered by the Broncos, ending the game and Cincinnati’s hopes for a first-round bye in the playoffs. With the loss, the Bengals fell to 11-4 on the season. The loss was also the 10th straight in Denver for the Bengals, dating back to 1975. |
| **Response** | The first field goal was by the Ravens. |
| **Label** | Hallucinated |
**Table 2: Example of a faithful response in HalluMix**
| | |
|----------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **Documents** | • Final Fantasy is a Japanese science fantasy anthology media franchise created by Hironobu Sakaguchi and developed and owned by Square Enix (formerly Square).• Peter Wright, a law supervisor for the DNR, told WLUC-TV that the officer was just doing his job. He said the officer believed it was a feral pig, since it had no identifying marks to distinguish him as a pet. ’I want to make it very clear that it’s never ever, ever the department’s position that we want to shoot people’s pets,’ said Wright. ’If he had any inkling it was a pet, he absolutely wouldn’t have shot it.’ Upsetting: The family are now trying to get Caesar’s body in order to bury him, but have been told they can only take possession of his ashes . Brandy Savelle and Tony Gervasi are now trying to get Caesar’s body back. However they have been told they can only take possession of ashes. Ms Savelle is demanding that some sort of recourse comes out of the situation. ’If it was that big of a mistake then we would like to see better training,’ she said. ’Let’s learn to identify not just pigs, but all pets.’• God Hates Us All is the eighth studio album by American thrash metal band Slayer .• that’s right that’s exactly right so but a lot of more women are starting their own businesses i’ve noticed than• The franchise centers on a series of fantasy and science fantasy role-playing video games. The first game in the series was released in 1987, with 15 numbered main entries having been released to date.• Shortly after 3600 BC Egyptian society began to grow and advance rapidly toward refined civilization .• boy pushing wagon with two pumpkins in it |
| **Response** | Final Fantasy was created by Hironobu Sakaguchi |
| **Label** | Faithful |
# 介绍HalluMix:面向真实场景幻觉检测的任务无关多领域基准
✉️ **联系方式:** {deanna, mike, freddie, julia}@quotientai.co
📜 **论文:** [《HalluMix:面向真实场景幻觉检测的任务无关多领域基准》,Emery等人(2025)](https://arxiv.org/abs/2505.00506)
随着大语言模型(Large Language Model, LLM)在关键行业中的应用日益广泛,确保其输出具备事实依据已成为一项核心关切。其中一个突出问题便是“幻觉(hallucination)”——即模型生成的内容与给定证据不相符甚至相悖。现有幻觉检测基准往往存在局限性,多为人工合成数据集,或仅局限于问答等特定任务。针对这一缺口,我们构建了**HalluMix**:一款任务无关、多领域的基准数据集,旨在评估真实多样场景下的幻觉检测能力。
## 为何选择HalluMix?
传统基准的不足在于难以还原真实场景的复杂性:真实场景中,往往需要依据多文档上下文来评估多句输出。HalluMix通过涵盖医疗、法律、科学、新闻等多个领域,以及摘要、问答、自然语言推理(Natural Language Inference, NLI)等多种任务来弥补这一缺陷。HalluMix中的每个样本包含以下内容:
- **上下文文档(Documents):** 以打乱顺序的文本块列表形式呈现的上下文,例如分词后的句子或段落块,并混入来自无关文档的随机冗余文本块,以此模拟真实世界的检索增强生成(Retrieval Augmented Generation, RAG)场景。
- **待评估假设(Answer):** 需要被检测的输出,例如摘要句、答案或主张。
- **幻觉标签(Hallucination Label):** 二元标记,用于标识该响应是否包含幻觉内容。
- **来源标识符(Source Identifier):** 用于溯源的原始数据集标签。
为了更贴近实际应用中遇到的检索噪声,HalluMix在忠实样本的上下文中引入干扰项,在不损害数据有效性的前提下提升了评估复杂度。
## HalluMix的构建
HalluMix通过严谨的转换流程整合了高质量的人工标注数据集:
- **自然语言推理(Natural Language Inference, NLI)数据集**([sentence-transformers/all-nli](https://huggingface.co/datasets/sentence-transformers/all-nli)、[stanfordnlp/snli](https://huggingface.co/datasets/stanfordnlp/snli)、[snli-hard](https://huggingface.co/datasets/au123/snli-hard)、[GLUE: mnli, rte, wnli](https://huggingface.co/datasets/nyu-mll/glue)):通过将“蕴含(entailment)”标签映射为忠实样本,将“中立(neutral)/矛盾(contradiction)”标签映射为幻觉样本完成适配。
- **摘要(Summarization)数据集**([sentence-transformers/altlex](https://huggingface.co/datasets/sentence-transformers/altlex)、[CNN/DailyMail](https://huggingface.co/datasets/abisee/cnn_dailymail)、[DialogSum](https://huggingface.co/datasets/knkarthick/dialogsum)、[XSum](https://huggingface.co/datasets/EdinburghNLP/xsum)、[arXiv summarization](https://huggingface.co/datasets/ccdv/arxiv-summarization)、[GovReport summarization](https://huggingface.co/datasets/ccdv/govreport-summarization)、[PubMed summarization](https://huggingface.co/datasets/ccdv/pubmed-summarization)):通过将摘要与无关文档进行不匹配配对,生成幻觉样本。
- **问答(Question Answering, QA)数据集**([SQuAD-v2](https://huggingface.co/datasets/rajpurkar/squad_v2)、[DROP](https://huggingface.co/datasets/ucinlp/drop)、[Databricks-Dolly-15K](https://huggingface.co/datasets/databricks/databricks-dolly-15k)、[PubMedQA](https://huggingface.co/datasets/qiaojin/PubMedQA)、[NarrativeQA](https://huggingface.co/datasets/deepmind/narrativeqa)):通过构造上下文-答案不匹配、大语言模型生成的看似合理但错误的答案,并将单字答案转换为陈述句以提升真实性。

这套严谨的方法最终生成了一个平衡且多样化的数据集,涵盖6500个样本,覆盖多种任务与领域,可支持广泛且可靠的评估。
## 使用HalluMix评估检测系统
我们使用HalluMix对7款主流幻觉检测系统(包括开源与闭源模型)进行了评估,得到了多项关键结论:
- **Quotient Detections** 取得了最优的综合性能(准确率:0.82,F1值:0.84),表现出均衡的精确率与召回率。
- [**Azure Groundedness**](https://learn.microsoft.com/zh-cn/azure/ai-services/content-safety/quickstart-groundedness?tabs=curl&pivots=programming-language-foundry-portal) 展现出较高的精确率,但召回率较低;而[**Ragas Faithfulness**](https://docs.ragas.io/zh-cn/stable/concepts/metrics/available_metrics/faithfulness/) 则拥有较高的召回率,但以牺牲精确率为代价。

- 系统性能随内容长度与任务类型存在显著差异。针对长上下文微调的模型(例如[Patronus Lynx 8B](https://huggingface.co/PatronusAI/Llama-3-Patronus-Lynx-8B-Instruct))在摘要任务中表现优异,但在较短的NLI或QA任务中表现欠佳。反之,基于句子的检测器(Quotient Detections与[Bespoke-Minicheck-7B](https://huggingface.co/bespokelabs/Bespoke-MiniCheck-7B))在短上下文场景中表现极佳,但难以处理长文本内容。

## 关键发现与启示
我们的分析得到了多项关键结论:
- **子数据集过拟合:** 部分检测系统似乎过度适配了特定数据集,表明其泛化能力有限。
- **内容长度挑战:** 有效的幻觉检测高度依赖对上下文长度的处理能力,以及对句间连贯性的保留能力。
- **架构权衡:** 连续上下文方法在长文本场景中表现出色,而句子级方法在短上下文精确检测中表现优异,但在长文档中会丢失上下文信息。
## 迈向鲁棒的真实场景检测
未来的研究应聚焦于融合两种方法的优势——例如通过分层或滑动窗口上下文——以确保在各类输入格式与长度下都能实现可靠的检测。我们公开发布HalluMix,旨在推动鲁棒幻觉检测工具的创新研发,这对部署可信的大语言模型应用至关重要。
通过HalluMix,我们朝着解决AI领域最紧迫的挑战之一迈出了重要一步:确保实际部署中的模型输出具备事实正确性与可信度。
## 引用
如果您认为HalluMix对您的研究有所帮助,请引用我们的论文:
@article{emery2025hallumix,
title={HalluMix: A Task-Agnostic, Multi-Domain Benchmark for Real-World Hallucination Detection},
author={Deanna Emery and Michael Goitia and Freddie Vargus and Iulia Neagu},
year={2025},
journal={arXiv preprint arXiv:2505.00506},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2505.00506},
}
## 附录
**表1:HalluMix中幻觉响应示例**
| | |
|----------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **上下文文档** | • 由于钢人队前一日负于乌鸦队,孟加拉虎队以AFC北区冠军的身份进入本场比赛。上半场,孟加拉虎队凭借麦卡伦的达阵传球和穆罕默德·萨努的冲球取得14-0的领先,但丹佛队在半场结束前仅剩18秒时,布兰登·麦克马纳斯命中一记23码的短距离射门,将分差缩小至11分。下半场,第三节迈克·努金特错失射门后,比赛局势发生重大转折。伊曼纽尔·桑德斯接布罗克·奥斯韦勒的8码传球,将分差缩小至14-10,随后丹佛队在第四节剩余11分17秒时,C.J.安德森39码达阵冲球首次取得领先。孟加拉虎队推进到前场,凭借迈克·努金特赛季最长的52码射门扳平比分,常规时间结束时比分定格为17-17。疲惫的孟加拉虎队在加时赛未能得分,丹佛队的麦克马纳斯37码射门得分,最终比分20-17。随后孟加拉虎队的进攻出现掉球失误,被野马队夺回,比赛结束,辛辛那坦队的季后赛首轮轮空希望破灭。本场失利后,孟加拉虎队本赛季战绩变为11胜4负。这也是孟加拉虎队自1975年以来在丹佛主场的第10场连败。 |
| **响应** | 第一记射门得分来自乌鸦队。 |
| **标签** | 存在幻觉 |
**表2:HalluMix中忠实响应示例**
| | |
|----------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **上下文文档** | • 《最终幻想》是由坂口博信创作、史克威尔艾尼克斯(原史克威尔)开发并拥有的日本科幻奇幻多媒体系列。• 美国自然资源部(DNR)的法律主管彼得·赖特告诉WLUC-TV,该警官只是在履行职责。他表示,警官认为这是一头野猪,因为它没有任何可以表明是宠物的标识。“我想明确说明,我们部门从来都不想射杀民众的宠物,”赖特说。“如果他当时知道这是宠物,绝对不会开枪。” 令人不安的是:这家人现在正试图取回凯撒的遗体以便安葬,但被告知只能领取它的骨灰。布兰迪·萨维尔和托尼·杰瓦西正在尝试取回凯撒的遗体,但被告知只能领取骨灰。萨维尔女士要求对此情况采取补救措施。“如果这是如此严重的失误,我们希望看到更好的培训,”她说。“我们要学会识别的不仅是猪,还有所有宠物。”• 《God Hates Us All》是美国鞭挞金属乐队Slayer的第八张录音室专辑。• 没错,正是如此,但我注意到越来越多的女性开始自主创业。• 该系列以一系列奇幻和科幻奇幻角色扮演视频游戏为核心。系列首款游戏于1987年发布,截至目前已推出15部主线正传作品。• 公元前3600年左右,埃及社会开始快速发展,迈向成熟的文明。• 男孩推着装有两个南瓜的手推车 |
| **响应** | 《最终幻想》由坂口博信创作 |
| **标签** | 忠实 |
提供机构:
maas
创建时间:
2025-07-28



