slhenty/climate-fever-nli-stsb

Name: slhenty/climate-fever-nli-stsb
Creator: slhenty
Published: 2023-03-24 21:08:44
License: 暂无描述

Hugging Face2023-03-24 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/slhenty/climate-fever-nli-stsb

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: unknown viewer: false --- **==========================================** **_IN PROGRESS - NOT READY FOR LOADING OR USE_** **==========================================** --- # Dataset Card for climate-fever-nli-stsb ## Dataset Description - **Homepage:** - **Repository:** - **Paper:** - **Leaderboard:** - **Point of Contact:** ### Dataset Summary The CLIMATE-FEVER dataset modified to supply NLI-style (**cf-nli**) features or STSb-style (**cf-stsb**) features that SentenceBERT training scripts can use as drop-in replacements for AllNLI and/or STSb datasets. There are two **cf-nli** datasets: one derived from only SUPPORTS and REFUTES evidence (**cf-nli**), and one that also derived data from NOT_ENOUGH_INFO evidence based on the annotator votes (**cf-nli-nei**). The feature style is specified as a named configuration when loading the dataset: cf-nli, cf-nli-nei, or cf-stsb. See usage notes below for `load_dataset` examples. ### Usage Load the **cf-nli** dataset ```python # if datasets not already in your environment !pip install datasets from datasets import load_dataset # all splits... dd = load_dataset('climate-fever-nli-stsb', 'cf-nli') # ... or specific split (only 'train' is available) ds_train = load_dataset('climate-fever-nli-stsb', 'cf-nli', split='train') ## ds_train can now be injected into SentenceBERT training scripts at the point ## where individual sentence pairs are aggregated into ## {'claim': {'entailment': set(), 'contradiction': set(), 'neutral': set()}} dicts ## for further processing into training samples ``` Load the **cf-nli-nei** dataset ```python # if datasets not already in your environment !pip install datasets from datasets import load_dataset # all splits... dd = load_dataset('climate-fever-nli-stsb', 'cf-nli-nei') # ... or specific split (only 'train' is available) ds_train = load_dataset('climate-fever-nli-stsb', 'cf-nli-nei', split='train') ## ds_train can now be injected into SentenceBERT training scripts at the point ## where individual sentence pairs are aggregated into ## {'claim': {'entailment': set(), 'contradiction': set(), 'neutral': set()}} dicts ## for further processing into training samples ``` Load the **cf-stsb** dataset ```python # if datasets not already in your environment !pip install datasets from datasets import load_dataset # all splits... dd = load_dataset('climate-fever-nli-stsb', 'cf-stsb') # ... or specific split ('train', 'dev', 'test' available) ds_dev = load_dataset('climate-fever-nli-stsb', 'cf-stsb', split='dev') ## ds_dev (or test) can now be injected into SentenceBERT training scripts at the point ## where individual sentence pairs are aggregated into ## a list of dev (or test) samples ```  ## Dataset Structure ### Data Instances [More Information Needed] ### Data Fields [More Information Needed] ### Data Splits [More Information Needed] ## Dataset Creation ### Curation Rationale SentenceBERT models are designed for 'Domain Adaptation' and/or 'Fine-tuning' using labeled data in the downstream task domain. As a bi-encoder, the primary objective function is real-valued similarity scoring. Typical training datasets use NLI-style features as input, and STSb-style features as model evaluation during training, and to measure post-hoc, _intrinsic_ STSb performance. Classification tasks typically use a classifier network that accepts SentenceBERT encodings as input, and is trained on class-labeled datasets. So, to fine-tune a SentenceBERT model in a climate-change domain, a labeled climate change dataset would be ideal. Much like the authors of the CLIMATE-FEVER dataset, we know of no other _labeled_ datasets specific to climate change. And while CLIMATE-FEVER is suitably labeled for classification tasks, it is not ready for similarity tuning in the style of SentenceBERT. This modified CLIMATE-FEVER dataset attempts to fill that gap by deriving NLI-style features typically used in pre-training and fine-tuning a SentenceBERT model. SentenceBERT also uses STSb-style features to evaluate model performance both during training and after training to gauge _intrinsic_ model performance on STSb. ### Source Data #### Initial Data Collection and Normalization see CLIMATE-FEVER #### Who are the source language producers? see CLIMATE-FEVER  ### Annotation process #### **cf-nli** For each Claim that has both SUPPORTS evidence and REFUTES evidence, create labeled pairs in the style of NLI dataset: | split | dataset | sentence1 | sentence2 | label | |---|---|---|---|---| | {'train', 'test'} | 'climate-fever' | claim | evidence | evidence_label SUPPORTS -> 'entailment', REFUTES -> 'contradiction' | > Note that by defintion, only claims classified as DISPUTED include both SUPPORTS and REFUTES evidence, so this dataset is limited to a small subset of CLIMATE-FEVER. ### **cf-nli-nei** This dataset uses the list of annotator 'votes' to cast a NOT_ENOUGH_INFO (NEI) evidence to a SUPPORTS or REFUTES evidence. By doing so, Claims in the SUPPORTS, REFUTES, and NEI classes can be used to generate additional sentence pairs. | votes | effective evidence_label | |---|---| | SUPPORTS > REFUTES | _SUPPORTS_ | | SUPPORTS < REFUTES | _REFUTES_ | In addition to all the claims in **cf-nli**, any Claims that have, * **_at least one_** SUPPORTS or REFUTES evidence, AND * NEI evidences that can be cast to effective _SUPPORTS_ or _REFUTES_ are included in the datasset. ### **cf-stsb** For each Claim <-> Evidence pair, create labeled pairs in the style of STSb dataset: | split | dataset | score | sentence1 | sentence2 | |---|---|---|---|---| | {'train', 'dev', 'test'} | 'climate-fever' | cos_sim score | claim | evidence | This dataset uses 'evidence_label', vote 'entropy', and the list of annotator 'votes' to derive a similarity score for each claim <-> evidence pairing. Similarity score conversion: > `mean(entropy)` refers to the average entropy within the defined group of evidence | evidence_label | votes | similarity score | |---|---|---| | SUPPORTS | SUPPORTS > 0, REFUTES == 0, NOT_ENOUGH_INFO (NEI) == 0 | 1 | | | SUPPORTS > 0, REFUTES == 0 | mean(entropy) | | | SUPPORTS > 0, REFUTES > 0 | 1 - mean(entropy) | | NEI | SUPPORTS > REFUTES | (1 - mean(entropy)) / 2| | | SUPPORTS == REFUTES | 0 | | | SUPPORTS < REFUTES | -(1 - mean(entropy)) / 2 | | REFUTES | SUPPORTS > 0, REFUTES > 0 | -(1 - mean(entropy)) | | | SUPPORTS == 0, REFUTES > 0 | -mean(entropy) | | | SUPPORTS == 0, REFUTES > 0, NEI == 0 | -1 | The above derivation roughly maps the strength of evidence annotation (REFUTES..NEI..SUPPORTS) to cosine similarity (-1..0..1).

提供机构：

slhenty

原始信息汇总

数据集概述

数据集名称

气候-发烧-NLI-STSb (climate-fever-nli-stsb)

数据集描述

目的: 该数据集是对CLIMATE-FEVER数据集的修改，旨在提供NLI风格的特征(cf-nli)或STSb风格的特征(cf-stsb)，以便SentenceBERT训练脚本可以使用这些特征作为AllNLI和/或STSb数据集的直接替代。
特征:
- cf-nli: 仅从SUPPORTS和REFUTES证据中派生的数据集。
- cf-nli-nei: 基于标注者投票，从NOT_ENOUGH_INFO证据中派生的数据集。
- cf-stsb: 用于评估模型性能的STSb风格特征。

数据集结构

加载方式: 使用load_dataset函数，通过指定不同的特征风格（cf-nli, cf-nli-nei, cf-stsb）来加载数据集。
数据分割:
- cf-nli 和 cf-nli-nei: 仅包含train分割。
- cf-stsb: 包含train, dev, test分割。

数据集创建

目的: 为了在气候变化领域微调SentenceBERT模型，该数据集尝试填补现有数据集在相似性调优方面的空白。
源数据: 基于CLIMATE-FEVER数据集，通过创建NLI风格的特征对数据进行修改。
标注过程:
- cf-nli: 对于每个同时具有SUPPORTS和REFUTES证据的Claim，创建NLI风格的标注对。
- cf-nli-nei: 使用标注者的投票将NOT_ENOUGH_INFO证据转换为有效的SUPPORTS或REFUTES证据。
- cf-stsb: 为每个Claim-Evidence对创建STSb风格的标注对，使用证据标签、投票熵和标注者投票来导出相似性分数。

使用示例

加载cf-nli数据集: python from datasets import load_dataset dd = load_dataset(climate-fever-nli-stsb, cf-nli)
加载cf-nli-nei数据集: python from datasets import load_dataset dd = load_dataset(climate-fever-nli-stsb, cf-nli-nei)
加载cf-stsb数据集: python from datasets import load_dataset ds_dev = load_dataset(climate-fever-nli-stsb, cf-stsb, split=dev)

5,000+

优质数据集

54 个

任务类型

进入经典数据集