five

pietrolesci/gen_debiased_nli

收藏
Hugging Face2022-04-25 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/pietrolesci/gen_debiased_nli
下载链接
链接失效反馈
官方服务:
资源简介:
## Overview Original dataset available [here](https://github.com/jimmycode/gen-debiased-nli#training-with-our-datasets). ```latex @inproceedings{gen-debiased-nli-2022, title = "Generating Data to Mitigate Spurious Correlations in Natural Language Inference Datasets", author = "Wu, Yuxiang and Gardner, Matt and Stenetorp, Pontus and Dasigi, Pradeep", booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics", month = may, year = "2022", publisher = "Association for Computational Linguistics", } ``` ## Dataset curation No curation. ## Code to create the dataset ```python import pandas as pd from datasets import Dataset, ClassLabel, Value, Features, DatasetDict import json from pathlib import Path # load data path = Path("./") ds = {} for i in path.rglob("*.jsonl"): print(i) name = str(i).split(".")[0].lower().replace("-", "_") with i.open("r") as fl: df = pd.DataFrame([json.loads(line) for line in fl]) ds[name] = df # cast to dataset features = Features( { "premise": Value(dtype="string"), "hypothesis": Value(dtype="string"), "label": ClassLabel(num_classes=3, names=["entailment", "neutral", "contradiction"]), "type": Value(dtype="string"), } ) ds = DatasetDict({k: Dataset.from_pandas(v, features=features) for k, v in ds.items()}) ds.push_to_hub("pietrolesci/gen_debiased_nli", token="<token>") # check overlap between splits from itertools import combinations for i, j in combinations(ds.keys(), 2): print( f"{i} - {j}: ", pd.merge( ds[i].to_pandas(), ds[j].to_pandas(), on=["premise", "hypothesis", "label"], how="inner", ).shape[0], ) #> mnli_seq_z - snli_z_aug: 0 #> mnli_seq_z - mnli_par_z: 477149 #> mnli_seq_z - snli_seq_z: 0 #> mnli_seq_z - mnli_z_aug: 333840 #> mnli_seq_z - snli_par_z: 0 #> snli_z_aug - mnli_par_z: 0 #> snli_z_aug - snli_seq_z: 506624 #> snli_z_aug - mnli_z_aug: 0 #> snli_z_aug - snli_par_z: 504910 #> mnli_par_z - snli_seq_z: 0 #> mnli_par_z - mnli_z_aug: 334960 #> mnli_par_z - snli_par_z: 0 #> snli_seq_z - mnli_z_aug: 0 #> snli_seq_z - snli_par_z: 583107 #> mnli_z_aug - snli_par_z: 0 ```
提供机构:
pietrolesci
原始信息汇总

数据集概述

  • 标题: Generating Data to Mitigate Spurious Correlations in Natural Language Inference Datasets
  • 作者: Wu, Yuxiang; Gardner, Matt; Stenetorp, Pontus; Dasigi, Pradeep
  • 发表会议: 60th Annual Meeting of the Association for Computational Linguistics (2022)
  • 出版者: Association for Computational Linguistics

数据集结构

  • 数据格式: JSONL
  • 数据字段:
    • premise: 字符串类型
    • hypothesis: 字符串类型
    • label: 分类标签,包含三种类型:"entailment", "neutral", "contradiction"
    • type: 字符串类型

数据集创建代码

  • 数据加载: 使用Python的pandas库从JSONL文件加载数据。
  • 数据转换: 将加载的数据转换为Hugging Face的Dataset对象,并定义了特征结构。
  • 数据上传: 数据集被上传至Hugging Face Hub,存储库名为"pietrolesci/gen_debiased_nli"。
  • 数据检查: 检查不同数据子集之间的重叠情况。
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作