pietrolesci/gen_debiased_nli
收藏Hugging Face2022-04-25 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/pietrolesci/gen_debiased_nli
下载链接
链接失效反馈官方服务:
资源简介:
## Overview
Original dataset available [here](https://github.com/jimmycode/gen-debiased-nli#training-with-our-datasets).
```latex
@inproceedings{gen-debiased-nli-2022,
title = "Generating Data to Mitigate Spurious Correlations in Natural Language Inference Datasets",
author = "Wu, Yuxiang and
Gardner, Matt and
Stenetorp, Pontus and
Dasigi, Pradeep",
booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics",
month = may,
year = "2022",
publisher = "Association for Computational Linguistics",
}
```
## Dataset curation
No curation.
## Code to create the dataset
```python
import pandas as pd
from datasets import Dataset, ClassLabel, Value, Features, DatasetDict
import json
from pathlib import Path
# load data
path = Path("./")
ds = {}
for i in path.rglob("*.jsonl"):
print(i)
name = str(i).split(".")[0].lower().replace("-", "_")
with i.open("r") as fl:
df = pd.DataFrame([json.loads(line) for line in fl])
ds[name] = df
# cast to dataset
features = Features(
{
"premise": Value(dtype="string"),
"hypothesis": Value(dtype="string"),
"label": ClassLabel(num_classes=3, names=["entailment", "neutral", "contradiction"]),
"type": Value(dtype="string"),
}
)
ds = DatasetDict({k: Dataset.from_pandas(v, features=features) for k, v in ds.items()})
ds.push_to_hub("pietrolesci/gen_debiased_nli", token="<token>")
# check overlap between splits
from itertools import combinations
for i, j in combinations(ds.keys(), 2):
print(
f"{i} - {j}: ",
pd.merge(
ds[i].to_pandas(),
ds[j].to_pandas(),
on=["premise", "hypothesis", "label"],
how="inner",
).shape[0],
)
#> mnli_seq_z - snli_z_aug: 0
#> mnli_seq_z - mnli_par_z: 477149
#> mnli_seq_z - snli_seq_z: 0
#> mnli_seq_z - mnli_z_aug: 333840
#> mnli_seq_z - snli_par_z: 0
#> snli_z_aug - mnli_par_z: 0
#> snli_z_aug - snli_seq_z: 506624
#> snli_z_aug - mnli_z_aug: 0
#> snli_z_aug - snli_par_z: 504910
#> mnli_par_z - snli_seq_z: 0
#> mnli_par_z - mnli_z_aug: 334960
#> mnli_par_z - snli_par_z: 0
#> snli_seq_z - mnli_z_aug: 0
#> snli_seq_z - snli_par_z: 583107
#> mnli_z_aug - snli_par_z: 0
```
提供机构:
pietrolesci
原始信息汇总
数据集概述
- 标题: Generating Data to Mitigate Spurious Correlations in Natural Language Inference Datasets
- 作者: Wu, Yuxiang; Gardner, Matt; Stenetorp, Pontus; Dasigi, Pradeep
- 发表会议: 60th Annual Meeting of the Association for Computational Linguistics (2022)
- 出版者: Association for Computational Linguistics
数据集结构
- 数据格式: JSONL
- 数据字段:
- premise: 字符串类型
- hypothesis: 字符串类型
- label: 分类标签,包含三种类型:"entailment", "neutral", "contradiction"
- type: 字符串类型
数据集创建代码
- 数据加载: 使用Python的pandas库从JSONL文件加载数据。
- 数据转换: 将加载的数据转换为Hugging Face的
Dataset对象,并定义了特征结构。 - 数据上传: 数据集被上传至Hugging Face Hub,存储库名为"pietrolesci/gen_debiased_nli"。
- 数据检查: 检查不同数据子集之间的重叠情况。



