tdiggelm/climate_fever
收藏Hugging Face2024-01-18 更新2024-05-25 收录
下载链接:
https://hf-mirror.com/datasets/tdiggelm/climate_fever
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- crowdsourced
- expert-generated
language_creators:
- found
language:
- en
license:
- unknown
multilinguality:
- monolingual
size_categories:
- 1K<n<10K
source_datasets:
- extended|wikipedia
- original
task_categories:
- text-classification
- text-retrieval
task_ids:
- text-scoring
- fact-checking
- fact-checking-retrieval
- semantic-similarity-scoring
- multi-input-text-classification
paperswithcode_id: climate-fever
pretty_name: ClimateFever
dataset_info:
features:
- name: claim_id
dtype: string
- name: claim
dtype: string
- name: claim_label
dtype:
class_label:
names:
'0': SUPPORTS
'1': REFUTES
'2': NOT_ENOUGH_INFO
'3': DISPUTED
- name: evidences
list:
- name: evidence_id
dtype: string
- name: evidence_label
dtype:
class_label:
names:
'0': SUPPORTS
'1': REFUTES
'2': NOT_ENOUGH_INFO
- name: article
dtype: string
- name: evidence
dtype: string
- name: entropy
dtype: float32
- name: votes
list: string
splits:
- name: test
num_bytes: 2429240
num_examples: 1535
download_size: 868947
dataset_size: 2429240
configs:
- config_name: default
data_files:
- split: test
path: data/test-*
---
# Dataset Card for ClimateFever
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:** [CLIMATE-FEVER homepage](http://climatefever.ai)
- **Repository:** [CLIMATE-FEVER repository](https://github.com/tdiggelm/climate-fever-dataset)
- **Paper:** [CLIMATE-FEVER: A Dataset for Verification of Real-World Climate Claims](https://arxiv.org/abs/2012.00614)
- **Leaderboard:** [Needs More Information]
- **Point of Contact:** [Thomas Diggelmann](mailto:thomasdi@student.ethz.ch)
### Dataset Summary
A dataset adopting the FEVER methodology that consists of 1,535 real-world claims regarding climate-change collected on the internet. Each claim is accompanied by five manually annotated evidence sentences retrieved from the English Wikipedia that support, refute or do not give enough information to validate the claim totalling in 7,675 claim-evidence pairs. The dataset features challenging claims that relate multiple facets and disputed cases of claims where both supporting and refuting evidence are present.
### Supported Tasks and Leaderboards
[Needs More Information]
### Languages
The text in the dataset is in English, as found in real-world claims about climate-change on the Internet. The associated BCP-47 code is `en`.
## Dataset Structure
### Data Instances
```
{
"claim_id": "0",
"claim": "Global warming is driving polar bears toward extinction",
"claim_label": 0, # "SUPPORTS"
"evidences": [
{
"evidence_id": "Extinction risk from global warming:170",
"evidence_label": 2, # "NOT_ENOUGH_INFO"
"article": "Extinction risk from global warming",
"evidence": "\"Recent Research Shows Human Activity Driving Earth Towards Global Extinction Event\".",
"entropy": 0.6931471805599453,
"votes": [
"SUPPORTS",
"NOT_ENOUGH_INFO",
null,
null,
null
]
},
{
"evidence_id": "Global warming:14",
"evidence_label": 0, # "SUPPORTS"
"article": "Global warming",
"evidence": "Environmental impacts include the extinction or relocation of many species as their ecosystems change, most immediately the environments of coral reefs, mountains, and the Arctic.",
"entropy": 0.0,
"votes": [
"SUPPORTS",
"SUPPORTS",
null,
null,
null
]
},
{
"evidence_id": "Global warming:178",
"evidence_label": 2, # "NOT_ENOUGH_INFO"
"article": "Global warming",
"evidence": "Rising temperatures push bees to their physiological limits, and could cause the extinction of bee populations.",
"entropy": 0.6931471805599453,
"votes": [
"SUPPORTS",
"NOT_ENOUGH_INFO",
null,
null,
null
]
},
{
"evidence_id": "Habitat destruction:61",
"evidence_label": 0, # "SUPPORTS"
"article": "Habitat destruction",
"evidence": "Rising global temperatures, caused by the greenhouse effect, contribute to habitat destruction, endangering various species, such as the polar bear.",
"entropy": 0.0,
"votes": [
"SUPPORTS",
"SUPPORTS",
null,
null,
null
]
},
{
"evidence_id": "Polar bear:1328",
"evidence_label": 2, # "NOT_ENOUGH_INFO"
"article": "Polar bear",
"evidence": "\"Bear hunting caught in global warming debate\".",
"entropy": 0.6931471805599453,
"votes": [
"SUPPORTS",
"NOT_ENOUGH_INFO",
null,
null,
null
]
}
]
}
```
### Data Fields
- `claim_id`: a `string` feature, unique claim identifier.
- `claim`: a `string` feature, claim text.
- `claim_label`: a `int` feature, overall label assigned to claim (based on evidence majority vote). The label correspond to 0: "supports", 1: "refutes", 2: "not enough info" and 3: "disputed".
- `evidences`: a list of evidences with fields:
- `evidence_id`: a `string` feature, unique evidence identifier.
- `evidence_label`: a `int` feature, micro-verdict label. The label correspond to 0: "supports", 1: "refutes" and 2: "not enough info".
- `article`: a `string` feature, title of source article (Wikipedia page).
- `evidence`: a `string` feature, evidence sentence.
- `entropy`: a `float32` feature, entropy reflecting uncertainty of `evidence_label`.
- `votes`: a `list` of `string` features, corresponding to individual votes.
### Data Splits
This benchmark dataset currently consists of a single data split `test` that consists of 1,535 claims or 7,675 claim-evidence pairs.
## Dataset Creation
### Curation Rationale
[Needs More Information]
### Source Data
#### Initial Data Collection and Normalization
[Needs More Information]
#### Who are the source language producers?
[Needs More Information]
### Annotations
#### Annotation process
[Needs More Information]
#### Who are the annotators?
[Needs More Information]
### Personal and Sensitive Information
[Needs More Information]
## Considerations for Using the Data
### Social Impact of Dataset
[Needs More Information]
### Discussion of Biases
[Needs More Information]
### Other Known Limitations
[Needs More Information]
## Additional Information
### Dataset Curators
[Needs More Information]
### Licensing Information
[Needs More Information]
### Citation Information
```bibtex
@misc{diggelmann2020climatefever,
title={CLIMATE-FEVER: A Dataset for Verification of Real-World Climate Claims},
author={Thomas Diggelmann and Jordan Boyd-Graber and Jannis Bulian and Massimiliano Ciaramita and Markus Leippold},
year={2020},
eprint={2012.00614},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
### Contributions
Thanks to [@tdiggelm](https://github.com/tdiggelm) for adding this dataset.
提供机构:
tdiggelm
原始信息汇总
数据集概述
名称: ClimateFever
语言: 英语 (en)
许可: 未知
多语言性: 单语
大小: 1K<n<10K
来源数据集: 扩展自Wikipedia,原始数据
任务类别: 文本分类, 文本检索
任务ID: 文本评分, 事实检查, 事实检查检索, 语义相似度评分, 多输入文本分类
论文代码ID: climate-fever
数据集大小:
- 下载大小: 868947字节
- 数据集大小: 2429240字节
数据集结构
数据实例
每个实例包含以下字段:
claim_id: 字符串, 唯一声明标识符claim: 字符串, 声明文本claim_label: 整数, 声明标签 (0: SUPPORTS, 1: REFUTES, 2: NOT_ENOUGH_INFO, 3: DISPUTED)evidences: 列表, 每个证据包含:evidence_id: 字符串, 唯一证据标识符evidence_label: 整数, 证据标签 (0: SUPPORTS, 1: REFUTES, 2: NOT_ENOUGH_INFO)article: 字符串, 来源文章标题evidence: 字符串, 证据句子entropy: 浮点数, 熵值, 反映evidence_label的不确定性votes: 字符串列表, 对应个人投票
数据分割
test: 包含1535个声明或7675个声明-证据对
数据集创建
注释
- 创建者: 众包, 专家生成
引用信息
bibtex @misc{diggelmann2020climatefever, title={CLIMATE-FEVER: A Dataset for Verification of Real-World Climate Claims}, author={Thomas Diggelmann and Jordan Boyd-Graber and Jannis Bulian and Massimiliano Ciaramita and Markus Leippold}, year={2020}, eprint={2012.00614}, archivePrefix={arXiv}, primaryClass={cs.CL} }
搜集汇总
数据集介绍

构建方式
在气候科学领域,数据集的构建需兼顾真实性与复杂性。ClimateFever数据集借鉴了FEVER方法学框架,从互联网广泛搜集了1535条关于气候变化的真实世界主张。每条主张均通过人工标注流程,从英文维基百科中检索并筛选出五条相关证据句子,形成总计7675个主张-证据对。标注过程融合了众包与专家生成的双重机制,确保了证据标签的可靠性,涵盖了支持、反驳或信息不足等多种判断类别,为气候主张的验证提供了结构化数据基础。
特点
该数据集在自然语言处理领域展现出独特价值,其核心特点在于主张的多样性与证据的复杂性。数据集收录的主张涉及气候变化的多个层面,包含存在争议的案例,即同一主张可能同时存在支持与反驳的证据。每条主张伴随的五条证据句子均附有熵值度量,反映了标注过程中的不确定性,增强了数据集的科学严谨性。这种多标签、多证据的结构设计,为开发能够处理现实世界模糊性与矛盾性的自动验证模型提供了挑战性测试平台。
使用方法
作为一项基准测试数据集,ClimateFever主要应用于事实核查与文本检索等任务。研究者可直接加载其测试集,利用其中的主张文本、证据句子及对应标签,训练或评估模型进行主张验证与证据检索。数据集中每条证据附带的投票信息与熵值,可用于分析模型在不确定情境下的决策能力。该数据集通常服务于多输入文本分类或语义相似度评分等研究,推动气候信息验证领域算法的发展与性能评估。
背景与挑战
背景概述
在气候变化议题日益成为全球科学与社会关注焦点的背景下,2020年,由Thomas Diggelmann、Jordan Boyd-Graber等研究人员共同构建了ClimateFever数据集。该数据集旨在应对气候科学信息验证的迫切需求,通过收集互联网上的真实气候主张,并基于FEVER方法论,为每条主张提供来自维基百科的标注证据,从而构建一个专门用于气候主张事实核查的基准资源。其核心研究问题聚焦于如何利用自然语言处理技术,自动化地评估气候相关主张的科学准确性,以对抗错误信息的传播,对计算社会科学与气候传播学领域产生了显著的交叉影响力。
当前挑战
ClimateFever数据集所针对的气候主张验证任务,本身面临多重挑战:气候科学主张常涉及复杂、多层面的科学证据与不确定性,模型需具备深度的科学推理与跨文档信息整合能力。同时,数据集中包含大量证据不足或存在争议的主张,要求系统能够精准识别信息边界并处理模糊性。在构建过程中,挑战主要源于数据收集与标注的复杂性:从互联网海量信息中筛选具有代表性的真实气候主张,并确保其多样性;依赖专家与众包进行精细的证据检索与标签标注,这一过程不仅成本高昂,且需解决标注者主观性带来的不一致问题,以及维基百科作为单一证据源可能存在的覆盖局限与时效性偏差。
常用场景
经典使用场景
在气候变化信息验证领域,ClimateFever数据集为自动事实核查系统提供了关键基准。该数据集通过收集互联网上关于气候变化的真实主张,并配以从维基百科手动检索的证据句子,构建了主张与证据之间的多标签关联。这一结构使得研究人员能够训练和评估模型在复杂语境下的推理能力,尤其是在处理支持、反驳或信息不足等微妙判断时,数据集中的争议性案例进一步挑战了模型处理矛盾证据的稳健性。
衍生相关工作
围绕ClimateFever数据集,学术界已衍生出多项经典研究工作。这些研究主要集中于改进证据检索与主张验证的联合模型架构,例如引入图神经网络以建模证据间的复杂关系,或利用预训练语言模型进行跨句子推理。部分工作进一步探索了数据集中的不确定性标签(如‘信息不足’和‘争议’)的建模方法,推动了事实核查系统在开放域、多证据场景下的鲁棒性与可解释性研究。
数据集最近研究
最新研究方向
在气候变化信息验证领域,ClimateFever数据集正推动前沿研究聚焦于多证据融合与争议性主张的自动化分析。该数据集通过真实世界的气候主张与维基百科证据的配对,为自然语言处理模型提供了复杂语义推理的测试平台。当前研究热点集中于开发能够处理证据冲突与信息不足情况的神经网络架构,以提升事实核查系统在环境科学议题上的鲁棒性。随着全球对气候虚假信息治理的日益重视,该数据集已成为评估模型在跨文档推理和不确定性量化方面性能的关键基准,对构建可信赖的气候信息生态系统具有重要科学价值。
以上内容由遇见数据集搜集并总结生成



