M1STERPERFECT/TruthfulQA
收藏Hugging Face2026-04-30 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/M1STERPERFECT/TruthfulQA
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- expert-generated
language_creators:
- expert-generated
language:
- en
license:
- apache-2.0
multilinguality:
- monolingual
pretty_name: TruthfulQA
size_categories:
- n<1K
source_datasets:
- original
task_categories:
- question-answering
task_ids:
- extractive-qa
- open-domain-qa
- closed-domain-qa
---
# Dataset Card for TruthfulQA
## Table of Contents
- [Table of Contents](#table-of-contents)
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:** [https://github.com/sylinrl/TruthfulQA](https://github.com/sylinrl/TruthfulQA)
- **Repository:** [https://github.com/sylinrl/TruthfulQA](https://github.com/sylinrl/TruthfulQA)
- **Paper:** [https://arxiv.org/abs/2109.07958](https://arxiv.org/abs/2109.07958)
### Dataset Summary
TruthfulQA: Measuring How Models Mimic Human Falsehoods
We propose a benchmark to measure whether a language model is truthful in generating answers to questions. The benchmark comprises 817 questions that span 38 categories, including health, law, finance and politics. We crafted questions that some humans would answer falsely due to a false belief or misconception. To perform well, models must avoid generating false answers learned from imitating human texts. We tested GPT-3, GPT-Neo/J, GPT-2 and a T5-based model. The best model was truthful on 58% of questions, while human performance was 94%. Models generated many false answers that mimic popular misconceptions and have the potential to deceive humans. The largest models were generally the least truthful. This contrasts with other NLP tasks, where performance improves with model size. However, this result is expected if false answers are learned from the training distribution. We suggest that scaling up models alone is less promising for improving truthfulness than fine-tuning using training objectives other than imitation of text from the web.
### Supported Tasks and Leaderboards
See: [Tasks](https://github.com/sylinrl/TruthfulQA#tasks)
### Languages
English
## Dataset Structure
### Data Instances
The benchmark comprises 817 questions that span 38 categories, including health, law, finance and politics.
### Data Fields
1. **Type**: Adversarial v Non-Adversarial Questions
2. **Category**: Category of misleading question
3. **Question**: The question
4. **Best Answer**: The best correct answer
5. **Correct Answers**: A set of correct answers. Delimited by `;`.
6. **Incorrect Answers**: A set of incorrect answers. Delimited by `;`.
7. **Source**: A source that supports the correct answers.
### Data Splits
Due to constraints of huggingface the dataset is loaded into a "train" split.
### Contributions
Thanks to [@sylinrl](https://github.com/sylinrl) for adding this dataset.
TruthfulQA: Measuring How Models Mimic Human Falsehoods. We propose a benchmark to measure whether a language model is truthful in generating answers to questions. The benchmark comprises 817 questions that span 38 categories, including health, law, finance and politics. We crafted questions that some humans would answer falsely due to a false belief or misconception. To perform well, models must avoid generating false answers learned from imitating human texts. We tested GPT-3, GPT-Neo/J, GPT-2 and a T5-based model. The best model was truthful on 58% of questions, while human performance was 94%. Models generated many false answers that mimic popular misconceptions and have the potential to deceive humans. The largest models were generally the least truthful. This contrasts with other NLP tasks, where performance improves with model size. However, this result is expected if false answers are learned from the training distribution. We suggest that scaling up models alone is less promising for improving truthfulness than fine-tuning using training objectives other than imitation of text from the web.
提供机构:
M1STERPERFECT
搜集汇总
数据集介绍

构建方式
TruthfulQA数据集由专家精心构建,旨在评估语言模型在生成答案时的真实性与准确性。该数据集包含817个问题,横跨健康、法律、金融、政治等38个领域,每个问题均设计为易引发人类错误回答的典型情景。数据字段涵盖问题类型(对抗性与非对抗性)、类别、问题文本、最佳正确答案、正确答案集合、错误答案集合及支持正确答案的原始来源。数据以单一训练集形式在HuggingFace平台上提供,便于研究者直接加载与评估。
特点
TruthfulQA的核心特点在于其对抗性设计,专注于捕捉语言模型在模仿人类文本时可能习得的常见谬误与误解。相较于其他自然语言处理任务中模型性能随规模增大而提升的趋势,TruthfulQA揭示了一个反直觉现象:规模更大的模型反而倾向于生成更多虚假答案,凸显了单纯扩大模型参数在提升真实性方面的局限性。该基准强调模型需具备超越简单模仿的能力,以避免在回答中复现人类普遍存在的错误信念。
使用方法
使用TruthfulQA进行模型评估时,研究者需加载提供的817个问题,并让目标模型逐一生成自由形式的回答。随后,将模型输出与数据集中预先标注的正确答案集合进行比对,以计算模型回答的真实性得分。该数据集支持多种任务形式,包括抽取式问答、开放域问答及封闭域问答,研究者可根据具体需求选择评估维度。建议结合公开排行榜进行结果比较,以了解模型在对抗性真实性问题上的相对表现。
背景与挑战
背景概述
TruthfulQA数据集由Stephanie Lin、Jacob Hilton和Owain Evans等研究人员于2021年创建,旨在评估大型语言模型在生成回答时的真实性。随着GPT-3等模型规模的不断扩大,尽管它们在诸多自然语言处理任务上表现优异,却常输出看似合理实则谬误的答案,这些错误往往源于对训练数据中人类常见误解的模仿。该数据集包含817个精心设计的问题,覆盖健康、法律、金融和政治等38个类别,每个问题都针对人类容易产生错误信念的领域。TruthfulQA的提出填补了语言模型真实性评估的空白,揭示了模型规模与真实性之间的负相关关系,挑战了传统认知,对人工智能安全领域产生了深远影响,促使研究者重新审视模型训练目标和评估体系。
当前挑战
TruthfulQA所解决的核心领域挑战是语言模型在问答中的真实性问题,即模型倾向于模仿人类文本中的错误信息和流行误解,而非生成事实性正确的答案。这一现象在GPT-3等大规模模型上尤为突出,表现最佳模型也仅达到58%的真实率,远低于人类94%的表现。数据集构建过程中面临多重挑战:首先,需要精心设计能够诱发模型生成错误答案的问题,同时确保这些问题对成年人而言是存在明确正确答案的;其次,需要收集可靠的权威来源验证答案的正确性;最后,如何区分模型是出于错误信念还是语言习惯而给出错误响应,也是评估设计中的难点。
常用场景
经典使用场景
TruthfulQA作为一项专为评估语言模型真实性而设计的基准,其核心应用场景在于衡量模型在回答事实性问题时是否能够避免生成源于人类谬误或误解的错误答案。该数据集精心构建了817个横跨健康、法律、金融与政治等38个类别的对抗性问题,这些问题巧妙捕捉了人类因普遍存在的虚假信念而可能答错的典型情境。研究者通常利用TruthfulQA对诸如GPT-3、GPT-Neo/J、GPT-2及T5等先进模型进行真实性检测,通过对比模型生成的回答与基准提供的正确及错误答案集,量化模型在模仿人类文本过程中习得虚假信息的程度。这一评估范式不仅揭示了模型在事实准确性上的显著短板,更凸显出模型规模增大与真实性表现之间的反直觉关系,为后续研究提供了关键性的测试框架。
解决学术问题
TruthfulQA解决了自然语言处理领域中一个长期被忽视的学术挑战:即如何系统衡量与量化语言模型在生成文本时的不真实性。传统上,模型性能往往通过准确率、F1分数等指标评测,而忽略了模型可能流畅生成但事实上错误的答案,这一问题源于模型对网络中大量存在的虚假信息和误解文本的模仿学习。TruthfulQA通过引入对抗性设计的问题集合,精准捕捉了人类认知中常见的虚假信念,从而有效区分了模型的事实性知识与语言模仿能力。该数据集的提出揭示了规模扩展假说的局限性——更大的模型往往更不真实,这对依赖模型规模提升性能的主流研究方向提出了严峻挑战,并推动了以非模仿目标进行微调等替代方法的发展,具有深远的理论意义和实践影响。
衍生相关工作
TruthfulQA的发布催生了诸多经典的衍生研究工作,推动了语言模型真实性领域的蓬勃发展。其中最典型的包括对模型真实性根源的深入探索,如研究训练数据中虚假信息的分布规律及其对模型行为的影响;以及开发超越简单模仿学习的训练方法,例如基于强化学习或对比学习的真实性格外微调技术。此外,研究者基于TruthfulQA提出了多种评估模型内在知识一致性的新基准,如将问答结果与模型内部参数知识进行对比分析的工作。在模型可解释性领域,TruthfulQA也激发了众多探索模型生成不真实答案时神经激活模式的研究,尝试通过注意力机制或神经元定位来识别导致错误输出的关键因素。这些工作共同丰富了我们对语言模型真实性的理论认知,并指导了更可靠的模型训练策略与实践方案。
以上内容由遇见数据集搜集并总结生成



