HaluBench
收藏魔搭社区2025-12-05 更新2025-05-24 收录
下载链接:
https://modelscope.cn/datasets/PatronusAI/HaluBench
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for HaluBench
## Dataset Details
HaluBench is a hallucination evaluation benchmark of 15k samples that consists of Context-Question-Answer triplets annotated for whether the examples contain
hallucinations. Compared to prior datasets, HaluBench is the first open-source benchmark containing hallucination tasks sourced from
real-world domains that include finance and medicine.
We sourced examples from several existing QA datasets to build the hallucination evaluation benchmark. We constructed tuples of (question, context,
answer, label), where label is a binary score that denotes whether the answer contains a hallucination.
The examples are sourced from and constructed using existing datasets such as FinanceBench, PubmedQA, CovidQA, HaluEval, DROP and RAGTruth.
- **Curated by:** Patronus AI
- **Language(s) (NLP):** English
## Use
HaluBench can be used to evaluate hallucination detection models. [The PatronusAI/Llama-3-Patronus-Lynx-70B-Instruct](https://huggingface.co/PatronusAI/Llama-3-Patronus-Lynx-70B-Instruct) outperforms GPT-4o, Claude-Sonnet and other open source models on HaluBench.
[PatronusAI/Llama-3-Patronus-Lynx-8B-Instruct](https://huggingface.co/PatronusAI/Llama-3-Patronus-Lynx-8B-Instruct) is a 8B variant that has only a ~3% gap compared to GPT-4o.
## Dataset Card Contact
[@sunitha-ravi](https://huggingface.co/sunitha-ravi)
# HaluBench 数据集卡片
## 数据集详情
HaluBench是一款包含15000条样本的幻觉评估基准数据集,由标注了样本是否存在幻觉的上下文-问题-答案(Context-Question-Answer)三元组构成。与此前的同类数据集相比,HaluBench是首个涵盖金融、医疗等真实领域幻觉评估任务的开源基准数据集。
本数据集从多个现有问答(QA)数据集采集样本以构建幻觉评估基准,我们构建了(问题、上下文、答案、标签)四元组(question, context, answer, label),其中标签为二元分值,用于标注答案是否包含幻觉。
样本采集自并基于FinanceBench、PubmedQA、CovidQA、HaluEval、DROP及RAGTruth等现有数据集构建而成。
- **数据集整理方:** Patronus AI
- **自然语言处理(NLP)支持语言:** 英语
## 应用场景
HaluBench可用于幻觉检测模型的性能评估。[PatronusAI/Llama-3-Patronus-Lynx-70B-Instruct](https://huggingface.co/PatronusAI/Llama-3-Patronus-Lynx-70B-Instruct)在HaluBench基准上的性能优于GPT-4o、Claude-Sonnet及其他开源模型。[PatronusAI/Llama-3-Patronus-Lynx-8B-Instruct](https://huggingface.co/PatronusAI/Llama-3-Patronus-Lynx-8B-Instruct)为8B参数量版本,与GPT-4o的性能差距仅约3%。
## 数据集卡片联系人
[@sunitha-ravi](https://huggingface.co/sunitha-ravi)
提供机构:
maas
创建时间:
2025-05-20
搜集汇总
数据集介绍

背景与挑战
背景概述
HaluBench是一个包含15k样本的幻觉评估基准数据集,由Context-Question-Answer三元组组成,标注了是否包含幻觉。数据集来源于多个现有QA数据集,覆盖金融和医学等真实领域,是首个开源的包含多领域幻觉任务的基准。
以上内容由遇见数据集搜集并总结生成



