pietrolesci/fracas

Name: pietrolesci/fracas
Creator: pietrolesci
Published: 2022-04-25 08:40:07
License: 暂无描述

Hugging Face2022-04-25 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/pietrolesci/fracas

下载链接

链接失效反馈

官方服务：

资源简介：

## Overview Original dataset [here](https://github.com/felipessalvatore/NLI_datasets). Below the original description reported for convenience. ```latex @MISC{Fracas96, author = {{The Fracas Consortium} and Robin Cooper and Dick Crouch and Jan Van Eijck and Chris Fox and Josef Van Genabith and Jan Jaspars and Hans Kamp and David Milward and Manfred Pinkal and Massimo Poesio and Steve Pulman and Ted Briscoe and Holger Maier and Karsten Konrad}, title = {Using the Framework}, year = {1996} } ``` Adapted from [https://nlp.stanford.edu/~wcmac/downloads/fracas.xml](https://nlp.stanford.edu/~wcmac/downloads/fracas.xml). We took `P1, ..., Pn` as premise and H as hypothesis. Labels have been mapped as follows `{'yes': "entailment", 'no': 'contradiction', 'undef': "neutral", 'unknown': "neutral"}`. And we randomly split 80/20 for train/dev. ## Dataset curation One hypothesis in the dev set and three hypotheses in the train set are empty and have been filled in with the empty string `""`. Labels are encoded with custom NLI mapping, that is ``` {"entailment": 0, "neutral": 1, "contradiction": 2} ``` ## Code to create the dataset ```python import pandas as pd from datasets import Features, Value, ClassLabel, Dataset, DatasetDict, load_dataset from pathlib import Path # load datasets path = Path("<path to folder>/nli_datasets") datasets = {} for dataset_path in path.iterdir(): datasets[dataset_path.name] = {} for name in dataset_path.iterdir(): df = pd.read_csv(name) datasets[dataset_path.name][name.name.split(".")[0]] = df ds = {} for name, df_ in datasets["fracas"].items(): df = df_.copy() assert df["label"].isna().sum() == 0 # fill-in empty hypothesis df = df.fillna("") # encode labels df["label"] = df["label"].map({"entailment": 0, "neutral": 1, "contradiction": 2}) # cast to dataset features = Features({ "premise": Value(dtype="string", id=None), "hypothesis": Value(dtype="string", id=None), "label": ClassLabel(num_classes=3, names=["entailment", "neutral", "contradiction"]), }) ds[name] = Dataset.from_pandas(df, features=features) dataset = DatasetDict(ds) dataset.push_to_hub("fracas", token="<token>") # check overlap between splits from itertools import combinations for i, j in combinations(ds.keys(), 2): print( f"{i} - {j}: ", pd.merge( ds[i].to_pandas(), ds[j].to_pandas(), on=["label", "premise", "hypothesis"], how="inner", ).shape[0], ) #> train - dev: 0 ```

提供机构：

pietrolesci

原始信息汇总

数据集概述

来源: 原始数据集来自此处。
数据处理: 数据集中的P1, ..., Pn被用作前提，H作为假设。标签映射如下：{yes: "entailment", no: contradiction, undef: "neutral", unknown: "neutral"}。数据集按80/20比例随机分为训练集和开发集。

数据集整理

空假设处理: 开发集中有一个假设和训练集中的三个假设为空，已被填充为空字符串""。
标签编码: 使用自定义NLI映射进行标签编码，具体为{"entailment": 0, "neutral": 1, "contradiction": 2}。

数据集创建代码

数据加载: 使用Python的pandas和datasets库加载和处理数据。
数据处理: 填充空假设，并对标签进行编码。
数据集结构: 数据集包含前提、假设和标签三个特征，其中标签为类别标签，有三个类别：entailment, neutral, contradiction。
数据集上传: 数据集被上传至Hub，并进行了训练集和开发集的交集检查，确保没有重叠。

5,000+

优质数据集

54 个

任务类型

进入经典数据集