five

pietrolesci/mpe

收藏
Hugging Face2022-04-25 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/pietrolesci/mpe
下载链接
链接失效反馈
官方服务:
资源简介:
## Overview Original dataset [here](https://github.com/aylai/MultiPremiseEntailment). ## Dataset curation Same data and splits as the original. The following columns have been added: - `premise`: concatenation of `premise1`, `premise2`, `premise3`, and `premise4` - `label`: encoded `gold_label` with the following mapping `{"entailment": 0, "neutral": 1, "contradiction": 2}` ## Code to create the dataset ```python import pandas as pd from datasets import Features, Value, ClassLabel, Dataset, DatasetDict from pathlib import Path # read data path = Path("<path to files>") datasets = {} for dataset_path in path.rglob("*.txt"): df = pd.read_csv(dataset_path, sep="\t") datasets[dataset_path.name.split("_")[1].split(".")[0]] = df ds = {} for name, df_ in datasets.items(): df = df_.copy() # fix parsing error for dev split if name == "dev": # fix parsing error df.loc[df["contradiction_judgments"] == "3 contradiction", "contradiction_judgments"] = 3 df.loc[df["gold_label"].isna(), "gold_label"] = "contradiction" # check no nan assert df.isna().sum().sum() == 0 # fix dtypes for col in ("entailment_judgments", "neutral_judgments", "contradiction_judgments"): df[col] = df[col].astype(int) # fix premise column for i in range(1, 4 + 1): df[f"premise{i}"] = df[f"premise{i}"].str.split("/", expand=True)[1] df["premise"] = df[[f"premise{i}" for i in range(1, 4 + 1)]].agg(" ".join, axis=1) # encode labels df["label"] = df["gold_label"].map({"entailment": 0, "neutral": 1, "contradiction": 2}) # cast to dataset features = Features({ "premise1": Value(dtype="string", id=None), "premise2": Value(dtype="string", id=None), "premise3": Value(dtype="string", id=None), "premise4": Value(dtype="string", id=None), "premise": Value(dtype="string", id=None), "hypothesis": Value(dtype="string", id=None), "entailment_judgments": Value(dtype="int32"), "neutral_judgments": Value(dtype="int32"), "contradiction_judgments": Value(dtype="int32"), "gold_label": Value(dtype="string"), "label": ClassLabel(num_classes=3, names=["entailment", "neutral", "contradiction"]), }) ds[name] = Dataset.from_pandas(df, features=features) # push to hub ds = DatasetDict(ds) ds.push_to_hub("mpe", token="<token>") # check overlap between splits from itertools import combinations for i, j in combinations(ds.keys(), 2): print( f"{i} - {j}: ", pd.merge( ds[i].to_pandas(), ds[j].to_pandas(), on=["premise", "hypothesis", "label"], how="inner", ).shape[0], ) #> dev - test: 0 #> dev - train: 0 #> test - train: 0 ```
提供机构:
pietrolesci
原始信息汇总

数据集概述

原始数据集链接:这里

数据集处理

与原始数据集相同的数据和分割。新增以下列:

  • premise:将 premise1, premise2, premise3, 和 premise4 连接起来。
  • label:使用以下映射对 gold_label 进行编码:{"entailment": 0, "neutral": 1, "contradiction": 2}

数据集创建代码

python import pandas as pd from datasets import Features, Value, ClassLabel, Dataset, DatasetDict from pathlib import Path

读取数据

path = Path("<path to files>") datasets = {} for dataset_path in path.rglob("*.txt"): df = pd.read_csv(dataset_path, sep=" ") datasets[dataset_path.name.split("_")[1].split(".")[0]] = df

ds = {} for name, df_ in datasets.items(): df = df_.copy()

# 修复 dev 分割的解析错误
if name == "dev":
    df.loc[df["contradiction_judgments"] == "3   contradiction", "contradiction_judgments"] = 3
    df.loc[df["gold_label"].isna(), "gold_label"] = "contradiction"

# 检查无 NaN 值
assert df.isna().sum().sum() == 0

# 修复数据类型
for col in ("entailment_judgments", "neutral_judgments", "contradiction_judgments"):
    df[col] = df[col].astype(int)

# 修复 premise 列
for i in range(1, 4 + 1):
    df[f"premise{i}"] = df[f"premise{i}"].str.split("/", expand=True)[1]
df["premise"] = df[[f"premise{i}" for i in range(1, 4 + 1)]].agg(" ".join, axis=1)

# 编码标签
df["label"] = df["gold_label"].map({"entailment": 0, "neutral": 1, "contradiction": 2})

# 转换为数据集
features = Features({
    "premise1": Value(dtype="string", id=None),
    "premise2": Value(dtype="string", id=None),
    "premise3": Value(dtype="string", id=None),
    "premise4": Value(dtype="string", id=None),
    "premise": Value(dtype="string", id=None),
    "hypothesis": Value(dtype="string", id=None),
    "entailment_judgments": Value(dtype="int32"),
    "neutral_judgments": Value(dtype="int32"),
    "contradiction_judgments": Value(dtype="int32"),
    "gold_label": Value(dtype="string"),
    "label": ClassLabel(num_classes=3, names=["entailment", "neutral", "contradiction"]),
})

ds[name] = Dataset.from_pandas(df, features=features)

推送到 hub

ds = DatasetDict(ds) ds.push_to_hub("mpe", token="<token>")

检查分割之间的重叠

from itertools import combinations for i, j in combinations(ds.keys(), 2): print( f"{i} - {j}: ", pd.merge( ds[i].to_pandas(), ds[j].to_pandas(), on=["premise", "hypothesis", "label"], how="inner", ).shape[0], ) #> dev - test: 0 #> dev - train: 0 #> test - train: 0

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作