pietrolesci/nli_fever

Name: pietrolesci/nli_fever
Creator: pietrolesci
Published: 2022-04-25 09:03:28
License: 暂无描述

Hugging Face2022-04-25 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/pietrolesci/nli_fever

下载链接

链接失效反馈

官方服务：

资源简介：

## Overview The original dataset can be found [here](https://www.dropbox.com/s/hylbuaovqwo2zav/nli_fever.zip?dl=0) while the Github repo is [here](https://github.com/easonnie/combine-FEVER-NSMN/blob/master/other_resources/nli_fever.md). This dataset has been proposed in [Combining fact extraction and verification with neural semantic matching networks](https://dl.acm.org/doi/abs/10.1609/aaai.v33i01.33016859). This dataset has been created as a modification of FEVER. In the original FEVER setting, the input is a claim from Wikipedia and the expected output is a label. However, this is different from the standard NLI formalization which is basically a *pair-of-sequence to label* problem. To facilitate NLI-related research to take advantage of the FEVER dataset, the authors pair the claims in the FEVER dataset with the textual evidence and make it a *pair-of-sequence to label* formatted dataset. ## Dataset curation The label mapping follows the paper and is the following ```python mapping = { "SUPPORTS": 0, # entailment "NOT ENOUGH INFO": 1, # neutral "REFUTES": 2, # contradiction } ``` Also, the "verifiable" column has been encoded as follows ```python mapping = {"NOT VERIFIABLE": 0, "VERIFIABLE": 1} ``` Finally, a consistency check with the labels reported in the original FEVER dataset is performed. NOTE: no label is available for the "test" split. NOTE: there are 3 instances in common between `dev` and `train` splits. ## Code to generate the dataset ```python import pandas as pd from datasets import Dataset, ClassLabel, load_dataset, Value, Features, DatasetDict import json # download data from https://www.dropbox.com/s/hylbuaovqwo2zav/nli_fever.zip?dl=0 paths = { "train": "<some_path>/nli_fever/train_fitems.jsonl", "validation": "<some_path>/nli_fever/dev_fitems.jsonl", "test": "<some_path>/nli_fever/test_fitems.jsonl", } # parsing code from https://github.com/facebookresearch/anli/blob/main/src/utils/common.py registered_jsonabl_classes = {} def register_class(cls): global registered_jsonabl_classes if cls not in registered_jsonabl_classes: registered_jsonabl_classes.update({cls.__name__: cls}) def unserialize_JsonableObject(d): global registered_jsonabl_classes classname = d.pop("_jcls_", None) if classname: cls = registered_jsonabl_classes[classname] obj = cls.__new__(cls) # Make instance without calling __init__ for key, value in d.items(): setattr(obj, key, value) return obj else: return d def load_jsonl(filename, debug_num=None): d_list = [] with open(filename, encoding="utf-8", mode="r") as in_f: print("Load Jsonl:", filename) for line in in_f: item = json.loads(line.strip(), object_hook=unserialize_JsonableObject) d_list.append(item) if debug_num is not None and 0 < debug_num == len(d_list): break return d_list def get_original_fever() -> pd.DataFrame: """Get original fever datasets.""" fever_v1 = load_dataset("fever", "v1.0") fever_v2 = load_dataset("fever", "v2.0") columns = ["id", "label"] splits = ["paper_test", "paper_dev", "labelled_dev", "train"] list_dfs = [fever_v1[split].to_pandas()[columns] for split in splits] list_dfs.append(fever_v2["validation"].to_pandas()[columns]) dfs = pd.concat(list_dfs, ignore_index=False) dfs = dfs.drop_duplicates() dfs = dfs.rename(columns={"label": "fever_gold_label"}) return dfs def load_and_process(path: str, fever_df: pd.DataFrame) -> pd.DataFrame: """Load data split and merge with fever.""" df = pd.DataFrame(load_jsonl(path)) df = df.rename(columns={"query": "premise", "context": "hypothesis"}) # adjust dtype df["cid"] = df["cid"].astype(int) # merge with original fever to get labels df = pd.merge(df, fever_df, left_on="cid", right_on="id", how="inner").drop_duplicates() return df def encode_labels(df: pd.DataFrame) -> pd.DataFrame: """Encode labels using the mapping used in SNLI and MultiNLI""" mapping = { "SUPPORTS": 0, # entailment "NOT ENOUGH INFO": 1, # neutral "REFUTES": 2, # contradiction } df["label"] = df["fever_gold_label"].map(mapping) # verifiable df["verifiable"] = df["verifiable"].map({"NOT VERIFIABLE": 0, "VERIFIABLE": 1}) return df if __name__ == "__main__": fever_df = get_original_fever() dataset_splits = {} for split, path in paths.items(): # from json to dataframe and merge with fever df = load_and_process(path, fever_df) if not len(df) > 0: print(f"Split `{split}` has no matches") continue if split == "train": # train must have same labels assert sum(df["fever_gold_label"] != df["label"]) == 0 # encode labels using the default mapping used by other nli datasets # i.e, entailment: 0, neutral: 1, contradiction: 2 df = df.drop(columns=["label"]) df = encode_labels(df) # cast to dataset features = Features( { "cid": Value(dtype="int64", id=None), "fid": Value(dtype="string", id=None), "id": Value(dtype="int32", id=None), "premise": Value(dtype="string", id=None), "hypothesis": Value(dtype="string", id=None), "verifiable": Value(dtype="int64", id=None), "fever_gold_label": Value(dtype="string", id=None), "label": ClassLabel(num_classes=3, names=["entailment", "neutral", "contradiction"]), } ) if "test" in path: # no features for test set df["label"] = -1 df["verifiable"] = -1 df["fever_gold_label"] = "not available" dataset = Dataset.from_pandas(df, features=features) dataset_splits[split] = dataset nli_fever = DatasetDict(dataset_splits) nli_fever.push_to_hub("pietrolesci/nli_fever", token="<your token>") # check overlap between splits from itertools import combinations for i, j in combinations(dataset_splits.keys(), 2): print( f"{i} - {j}: ", pd.merge( dataset_splits[i].to_pandas(), dataset_splits[j].to_pandas(), on=["premise", "hypothesis", "label"], how="inner", ).shape[0], ) #> train - dev: 3 #> train - test: 0 #> dev - test: 0 ```

提供机构：

pietrolesci

原始信息汇总

数据集概述

来源与修改：该数据集是对FEVER数据集的修改，旨在将FEVER数据集转换为标准的NLI（自然语言推理）格式，即pair-of-sequence to label问题。
原始数据下载：原始数据集可从此处下载。
相关论文：数据集的创建与论文Combining fact extraction and verification with neural semantic matching networks相关。

数据集整理

标签映射：数据集中的标签映射如下： python mapping = { "SUPPORTS": 0, # entailment "NOT ENOUGH INFO": 1, # neutral "REFUTES": 2, # contradiction }
可验证性编码：数据集中的“可验证性”列编码如下： python mapping = {"NOT VERIFIABLE": 0, "VERIFIABLE": 1}
数据集分割：数据集分为train、validation和test三个部分。
特殊注意：
- 测试集test中没有提供标签。
- 训练集train和验证集dev之间有3个实例重叠。

数据集生成代码

数据加载与处理：使用Python脚本从JSONL文件加载数据，并进行必要的预处理和标签编码。
数据集结构：数据集包含以下特征： python Features( { "cid": Value(dtype="int64", id=None), "fid": Value(dtype="string", id=None), "id": Value(dtype="int32", id=None), "premise": Value(dtype="string", id=None), "hypothesis": Value(dtype="string", id=None), "verifiable": Value(dtype="int64", id=None), "fever_gold_label": Value(dtype="string", id=None), "label": ClassLabel(num_classes=3, names=["entailment", "neutral", "contradiction"]), } )
数据集上传：数据集已上传至Hugging Face的Hub，地址为pietrolesci/nli_fever。

搜集汇总

数据集介绍

构建方式

在自然语言推理领域，为充分利用事实核查数据集的研究价值，nli_fever数据集应运而生。该数据集基于FEVER数据集进行重构，将原始的单序列标注任务转化为序列对标注形式。构建过程中，研究者将维基百科声明与对应文本证据配对，形成标准化的前提-假设序列对。通过精心设计的标签映射机制，将原始支持、信息不足与反驳三类标签分别对应到蕴含、中性与矛盾范畴，并整合了可验证性标注维度。数据生成流程采用自动化脚本实现，确保与原始FEVER标签的一致性校验，最终形成包含训练集、验证集与测试集的完整语料库。

特点

该数据集最显著的特征在于其双重属性融合，既保留了事实核查任务中声明与证据的严谨对应关系，又具备自然语言推理任务的标准形式化结构。数据条目包含唯一标识符、前提文本、假设文本及多重标注维度，其中可验证性标注为研究提供了额外的分析视角。值得注意的是，数据集在划分时保持了各子集间的低重叠特性，仅存在极少量训练集与验证集的交叉样本。标签体系采用三类分类框架，与主流自然语言推理数据集保持兼容，便于模型迁移与对比研究。

使用方法

使用该数据集时，研究者可通过HuggingFace平台直接加载预处理版本，或依据开源代码自行重构原始数据。典型应用场景包括自然语言推理模型训练、事实核查系统开发以及多任务学习框架构建。数据加载后需注意测试集标签缺失的设计特性，建议采用训练-验证双阶段评估策略。对于进阶研究，可结合可验证性标注探索证据可靠性对推理过程的影响，或通过对比原始FEVER数据研究任务形式化转换的效应。数据集的标准化格式确保了与现有NLI评估框架的无缝对接。

背景与挑战

背景概述

在自然语言推理领域，数据集构建是推动模型理解文本语义关系的关键。nli_fever数据集由研究团队基于FEVER数据集改造而成，旨在将事实核查任务转化为标准的自然语言推理形式。该数据集通过将维基百科中的声明与相关文本证据配对，形成前提-假设对，并标注蕴含、中立或矛盾关系，从而为NLI研究提供了大规模、高质量的数据资源。其创建促进了事实核查与语义匹配技术的交叉融合，为后续的模型设计与评估奠定了重要基础。

当前挑战

该数据集致力于解决自然语言推理在事实核查场景中的挑战，核心在于模型需精准判断声明与证据间的逻辑关系，这对语义理解和推理能力提出了较高要求。在构建过程中，挑战主要体现在数据转换与标注一致性上：原始FEVER数据格式与标准NLI存在差异，需重新配对并映射标签；同时，确保转换后数据与原始标签保持一致，且避免训练集、验证集之间的样本重叠，这些步骤均需细致的工程处理与验证。

常用场景

经典使用场景

在自然语言推理领域，nli_fever数据集以其独特的结构为研究者提供了宝贵的资源。该数据集将FEVER中的声明与文本证据配对，转化为标准的前提-假设对形式，从而支持序列到标签的推理任务。经典使用场景包括训练和评估自然语言推理模型，特别是针对文本蕴含、中立和矛盾三类关系的分类任务。其数据来源于维基百科，涵盖了广泛的事实性声明，使得模型能够在真实世界知识背景下进行推理，提升了自然语言理解任务的复杂性和实用性。

实际应用

在实际应用层面，nli_fever数据集为自动化事实核查系统的发展提供了训练和测试基础。基于该数据集训练的模型可以用于分析新闻文章、社交媒体内容或其他文本中的声明，自动判断其与已知证据的一致性。这有助于打击虚假信息，辅助内容审核，并在教育、新闻出版等领域支持信息验证流程。其结构化的前提-假设对形式也便于集成到更广泛的信息检索和知识问答系统中，增强这些系统的事实准确性。

衍生相关工作

围绕nli_fever数据集，衍生了一系列经典研究工作。其源论文《Combining fact extraction and verification with neural semantic matching networks》提出了结合事实抽取与验证的神经语义匹配网络框架。该数据集进一步激发了针对混合任务（如联合推理与检索）的模型架构探索，以及对于模型在开放域证据上进行推理的鲁棒性研究。相关工作还包括利用该数据集进行对抗性示例生成、研究模型对“信息不足”情况的处理，以及推动多任务学习在自然语言推理与事实核查中的应用。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集