pietrolesci/fracas
收藏Hugging Face2022-04-25 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/pietrolesci/fracas
下载链接
链接失效反馈官方服务:
资源简介:
## Overview
Original dataset [here](https://github.com/felipessalvatore/NLI_datasets).
Below the original description reported for convenience.
```latex
@MISC{Fracas96,
author = {{The Fracas Consortium} and Robin Cooper and Dick Crouch and Jan Van Eijck and Chris Fox and Josef Van Genabith and Jan Jaspars and Hans Kamp and David Milward and Manfred Pinkal and Massimo Poesio and Steve Pulman and Ted Briscoe and Holger Maier and Karsten Konrad},
title = {Using the Framework},
year = {1996}
}
```
Adapted from [https://nlp.stanford.edu/~wcmac/downloads/fracas.xml](https://nlp.stanford.edu/~wcmac/downloads/fracas.xml). We took `P1, ..., Pn` as premise and H as hypothesis. Labels have been mapped as follows `{'yes': "entailment", 'no': 'contradiction', 'undef': "neutral", 'unknown': "neutral"}`. And we randomly split 80/20 for train/dev.
## Dataset curation
One hypothesis in the dev set and three hypotheses in the train set are empty and have been
filled in with the empty string `""`. Labels are encoded with custom NLI mapping, that is
```
{"entailment": 0, "neutral": 1, "contradiction": 2}
```
## Code to create the dataset
```python
import pandas as pd
from datasets import Features, Value, ClassLabel, Dataset, DatasetDict, load_dataset
from pathlib import Path
# load datasets
path = Path("<path to folder>/nli_datasets")
datasets = {}
for dataset_path in path.iterdir():
datasets[dataset_path.name] = {}
for name in dataset_path.iterdir():
df = pd.read_csv(name)
datasets[dataset_path.name][name.name.split(".")[0]] = df
ds = {}
for name, df_ in datasets["fracas"].items():
df = df_.copy()
assert df["label"].isna().sum() == 0
# fill-in empty hypothesis
df = df.fillna("")
# encode labels
df["label"] = df["label"].map({"entailment": 0, "neutral": 1, "contradiction": 2})
# cast to dataset
features = Features({
"premise": Value(dtype="string", id=None),
"hypothesis": Value(dtype="string", id=None),
"label": ClassLabel(num_classes=3, names=["entailment", "neutral", "contradiction"]),
})
ds[name] = Dataset.from_pandas(df, features=features)
dataset = DatasetDict(ds)
dataset.push_to_hub("fracas", token="<token>")
# check overlap between splits
from itertools import combinations
for i, j in combinations(ds.keys(), 2):
print(
f"{i} - {j}: ",
pd.merge(
ds[i].to_pandas(),
ds[j].to_pandas(),
on=["label", "premise", "hypothesis"],
how="inner",
).shape[0],
)
#> train - dev: 0
```
提供机构:
pietrolesci
原始信息汇总
数据集概述
- 来源: 原始数据集来自此处。
- 数据处理: 数据集中的
P1, ..., Pn被用作前提,H作为假设。标签映射如下:{yes: "entailment", no: contradiction, undef: "neutral", unknown: "neutral"}。数据集按80/20比例随机分为训练集和开发集。
数据集整理
- 空假设处理: 开发集中有一个假设和训练集中的三个假设为空,已被填充为空字符串
""。 - 标签编码: 使用自定义NLI映射进行标签编码,具体为
{"entailment": 0, "neutral": 1, "contradiction": 2}。
数据集创建代码
- 数据加载: 使用Python的
pandas和datasets库加载和处理数据。 - 数据处理: 填充空假设,并对标签进行编码。
- 数据集结构: 数据集包含前提、假设和标签三个特征,其中标签为类别标签,有三个类别:
entailment,neutral,contradiction。 - 数据集上传: 数据集被上传至Hub,并进行了训练集和开发集的交集检查,确保没有重叠。



