slhenty/climate-fever-nli-stsb
收藏Hugging Face2023-03-24 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/slhenty/climate-fever-nli-stsb
下载链接
链接失效反馈官方服务:
资源简介:
---
license: unknown
viewer: false
---
**==========================================**
**_IN PROGRESS - NOT READY FOR LOADING OR USE_**
**==========================================**
---
# Dataset Card for climate-fever-nli-stsb
## Dataset Description
- **Homepage:**
- **Repository:**
- **Paper:**
- **Leaderboard:**
- **Point of Contact:**
### Dataset Summary
The CLIMATE-FEVER dataset modified to supply NLI-style (**cf-nli**) features or STSb-style (**cf-stsb**) features that SentenceBERT training scripts can use as drop-in replacements for AllNLI and/or STSb datasets.
There are two **cf-nli** datasets: one derived from only SUPPORTS and REFUTES evidence (**cf-nli**), and one that also derived data from NOT_ENOUGH_INFO evidence based on the annotator votes (**cf-nli-nei**).
The feature style is specified as a named configuration when loading the dataset: cf-nli, cf-nli-nei, or cf-stsb. See usage notes below for `load_dataset` examples.
### Usage
Load the **cf-nli** dataset
```python
# if datasets not already in your environment
!pip install datasets
from datasets import load_dataset
# all splits...
dd = load_dataset('climate-fever-nli-stsb', 'cf-nli')
# ... or specific split (only 'train' is available)
ds_train = load_dataset('climate-fever-nli-stsb', 'cf-nli', split='train')
## ds_train can now be injected into SentenceBERT training scripts at the point
## where individual sentence pairs are aggregated into
## {'claim': {'entailment': set(), 'contradiction': set(), 'neutral': set()}} dicts
## for further processing into training samples
```
Load the **cf-nli-nei** dataset
```python
# if datasets not already in your environment
!pip install datasets
from datasets import load_dataset
# all splits...
dd = load_dataset('climate-fever-nli-stsb', 'cf-nli-nei')
# ... or specific split (only 'train' is available)
ds_train = load_dataset('climate-fever-nli-stsb', 'cf-nli-nei', split='train')
## ds_train can now be injected into SentenceBERT training scripts at the point
## where individual sentence pairs are aggregated into
## {'claim': {'entailment': set(), 'contradiction': set(), 'neutral': set()}} dicts
## for further processing into training samples
```
Load the **cf-stsb** dataset
```python
# if datasets not already in your environment
!pip install datasets
from datasets import load_dataset
# all splits...
dd = load_dataset('climate-fever-nli-stsb', 'cf-stsb')
# ... or specific split ('train', 'dev', 'test' available)
ds_dev = load_dataset('climate-fever-nli-stsb', 'cf-stsb', split='dev')
## ds_dev (or test) can now be injected into SentenceBERT training scripts at the point
## where individual sentence pairs are aggregated into
## a list of dev (or test) samples
```
<!--
### Supported Tasks and Leaderboards
[More Information Needed]
### Languages
[More Information Needed]
-->
## Dataset Structure
### Data Instances
[More Information Needed]
### Data Fields
[More Information Needed]
### Data Splits
[More Information Needed]
## Dataset Creation
### Curation Rationale
SentenceBERT models are designed for 'Domain Adaptation' and/or 'Fine-tuning' using labeled data in the downstream task domain. As a bi-encoder, the primary objective function is real-valued similarity scoring. Typical training datasets use NLI-style features as input, and STSb-style features as model evaluation during training, and to measure post-hoc, _intrinsic_ STSb performance. Classification tasks typically use a classifier network that accepts SentenceBERT encodings as input, and is trained on class-labeled datasets.
So, to fine-tune a SentenceBERT model in a climate-change domain, a labeled climate change dataset would be ideal. Much like the authors of the CLIMATE-FEVER dataset, we know of no other _labeled_ datasets specific to climate change. And while CLIMATE-FEVER is suitably labeled for classification tasks, it is not ready for similarity tuning in the style of SentenceBERT.
This modified CLIMATE-FEVER dataset attempts to fill that gap by deriving NLI-style features typically used in pre-training and fine-tuning a SentenceBERT model. SentenceBERT also uses STSb-style features to evaluate model performance both during training and after training to gauge _intrinsic_ model performance on STSb.
### Source Data
#### Initial Data Collection and Normalization
see CLIMATE-FEVER
#### Who are the source language producers?
see CLIMATE-FEVER
<!--
### Annotations
-->
### Annotation process
#### **cf-nli**
For each Claim that has both SUPPORTS evidence and REFUTES evidence, create labeled pairs in the style of NLI dataset:
| split | dataset | sentence1 | sentence2 | label |
|---|---|---|---|---|
| {'train', 'test'} | 'climate-fever' | claim | evidence | evidence_label SUPPORTS -> 'entailment', REFUTES -> 'contradiction' |
> Note that by defintion, only claims classified as DISPUTED include both SUPPORTS and REFUTES evidence, so this dataset is limited to a small subset of CLIMATE-FEVER.
### **cf-nli-nei**
This dataset uses the list of annotator 'votes' to cast a NOT_ENOUGH_INFO (NEI) evidence to a SUPPORTS or REFUTES evidence. By doing so, Claims in the SUPPORTS, REFUTES, and NEI classes can be used to generate additional sentence pairs.
| votes | effective evidence_label |
|---|---|
| SUPPORTS > REFUTES | _SUPPORTS_ |
| SUPPORTS < REFUTES | _REFUTES_ |
In addition to all the claims in **cf-nli**, any Claims that have,
* **_at least one_** SUPPORTS or REFUTES evidence, AND
* NEI evidences that can be cast to effective _SUPPORTS_ or _REFUTES_
are included in the datasset.
### **cf-stsb**
For each Claim <-> Evidence pair, create labeled pairs in the style of STSb dataset:
| split | dataset | score | sentence1 | sentence2 |
|---|---|---|---|---|
| {'train', 'dev', 'test'} | 'climate-fever' | cos_sim score | claim | evidence |
This dataset uses 'evidence_label', vote 'entropy', and the list of annotator 'votes' to derive a similarity score for each claim <-> evidence pairing. Similarity score conversion:
> `mean(entropy)` refers to the average entropy within the defined group of evidence
| evidence_label | votes | similarity score |
|---|---|---|
| SUPPORTS | SUPPORTS > 0, REFUTES == 0, NOT_ENOUGH_INFO (NEI) == 0 | 1 |
| | SUPPORTS > 0, REFUTES == 0 | mean(entropy) |
| | SUPPORTS > 0, REFUTES > 0 | 1 - mean(entropy) |
| NEI | SUPPORTS > REFUTES | (1 - mean(entropy)) / 2|
| | SUPPORTS == REFUTES | 0 |
| | SUPPORTS < REFUTES | -(1 - mean(entropy)) / 2 |
| REFUTES | SUPPORTS > 0, REFUTES > 0 | -(1 - mean(entropy)) |
| | SUPPORTS == 0, REFUTES > 0 | -mean(entropy) |
| | SUPPORTS == 0, REFUTES > 0, NEI == 0 | -1 |
The above derivation roughly maps the strength of evidence annotation (REFUTES..NEI..SUPPORTS) to cosine similarity (-1..0..1).
<!--
#### Who are the annotators?
[More Information Needed]
### Personal and Sensitive Information
[More Information Needed]
## Considerations for Using the Data
### Social Impact of Dataset
[More Information Needed]
### Discussion of Biases
[More Information Needed]
### Other Known Limitations
[More Information Needed]
## Additional Information
### Dataset Curators
[More Information Needed]
### Licensing Information
[More Information Needed]
### Citation Information
[More Information Needed]
### Contributions
[More Information Needed]
-->
提供机构:
slhenty
原始信息汇总
数据集概述
数据集名称
- 气候-发烧-NLI-STSb (climate-fever-nli-stsb)
数据集描述
- 目的: 该数据集是对CLIMATE-FEVER数据集的修改,旨在提供NLI风格的特征(cf-nli)或STSb风格的特征(cf-stsb),以便SentenceBERT训练脚本可以使用这些特征作为AllNLI和/或STSb数据集的直接替代。
- 特征:
- cf-nli: 仅从SUPPORTS和REFUTES证据中派生的数据集。
- cf-nli-nei: 基于标注者投票,从NOT_ENOUGH_INFO证据中派生的数据集。
- cf-stsb: 用于评估模型性能的STSb风格特征。
数据集结构
- 加载方式: 使用
load_dataset函数,通过指定不同的特征风格(cf-nli, cf-nli-nei, cf-stsb)来加载数据集。 - 数据分割:
- cf-nli 和 cf-nli-nei: 仅包含train分割。
- cf-stsb: 包含train, dev, test分割。
数据集创建
- 目的: 为了在气候变化领域微调SentenceBERT模型,该数据集尝试填补现有数据集在相似性调优方面的空白。
- 源数据: 基于CLIMATE-FEVER数据集,通过创建NLI风格的特征对数据进行修改。
- 标注过程:
- cf-nli: 对于每个同时具有SUPPORTS和REFUTES证据的Claim,创建NLI风格的标注对。
- cf-nli-nei: 使用标注者的投票将NOT_ENOUGH_INFO证据转换为有效的SUPPORTS或REFUTES证据。
- cf-stsb: 为每个Claim-Evidence对创建STSb风格的标注对,使用证据标签、投票熵和标注者投票来导出相似性分数。
使用示例
-
加载cf-nli数据集: python from datasets import load_dataset dd = load_dataset(climate-fever-nli-stsb, cf-nli)
-
加载cf-nli-nei数据集: python from datasets import load_dataset dd = load_dataset(climate-fever-nli-stsb, cf-nli-nei)
-
加载cf-stsb数据集: python from datasets import load_dataset ds_dev = load_dataset(climate-fever-nli-stsb, cf-stsb, split=dev)



