pietrolesci/recast_white
收藏Hugging Face2022-04-22 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/pietrolesci/recast_white
下载链接
链接失效反馈官方服务:
资源简介:
## Overview
This dataset has been introduced by "Inference is Everything: Recasting Semantic Resources into a Unified Evaluation Framework", Aaron Steven White, Pushpendre Rastogi, Kevin Duh, Benjamin Van Durme. IJCNLP, 2017. Original data available [here](https://github.com/decompositional-semantics-initiative/DNC/raw/master/inference_is_everything.zip).
## Dataset curation
The following processing is applied
- `hypothesis_grammatical` and `judgement_valid` columns are filled with `""` when empty
- all columns are stripped
- the `entailed` column is renamed `label`
- `label` column is encoded with the following mapping `{"not-entailed": 0, "entailed": 1}`
- columns `rating` and `good_word` are dropped from `fnplus` dataset
## Code to generate the dataset
```python
import pandas as pd
from datasets import Features, Value, ClassLabel, Dataset, DatasetDict
ds = {}
for name in ("fnplus", "sprl", "dpr"):
# read data
with open(f"<path to files>/{name}_data.txt", "r") as f:
data = f.read()
data = data.split("\n\n")
data = [lines.split("\n") for lines in data]
data = [dict([col.split(":", maxsplit=1) for col in line if len(col) > 0]) for line in data]
df = pd.DataFrame(data)
# fill empty hypothesis_grammatical and judgement_valid
df["hypothesis_grammatical"] = df["hypothesis_grammatical"].fillna("")
df["judgement_valid"] = df["judgement_valid"].fillna("")
# fix dtype
df["index"] = df["index"].astype(int)
# strip
for col in df.select_dtypes(object).columns:
df[col] = df[col].str.strip()
# rename columns
df = df.rename(columns={"entailed": "label"})
# encode labels
df["label"] = df["label"].map({"not-entailed": 0, "entailed": 1})
# cast to dataset
features = Features({
"provenance": Value(dtype="string", id=None),
"index": Value(dtype="int64", id=None),
"text": Value(dtype="string", id=None),
"hypothesis": Value(dtype="string", id=None),
"partof": Value(dtype="string", id=None),
"hypothesis_grammatical": Value(dtype="string", id=None),
"judgement_valid": Value(dtype="string", id=None),
"label": ClassLabel(num_classes=2, names=["not-entailed", "entailed"]),
})
# select common columns
df = df.loc[:, list(features.keys())]
ds[name] = Dataset.from_pandas(df, features=features)
ds = DatasetDict(ds)
ds.push_to_hub("recast_white", token="<token>")
```
提供机构:
pietrolesci
原始信息汇总
数据集概述
本数据集由Aaron Steven White, Pushpendre Rastogi, Kevin Duh, Benjamin Van Durme在IJCNLP 2017中发表的论文"Inference is Everything: Recasting Semantic Resources into a Unified Evaluation Framework"引入。原始数据可在此链接获取:https://github.com/decompositional-semantics-initiative/DNC/raw/master/inference_is_everything.zip。
数据集处理
- 对
hypothesis_grammatical和judgement_valid列中的空值填充为""。 - 所有列进行strip处理。
- 将
entailed列重命名为label。 - 对
label列进行编码,映射为{"not-entailed": 0, "entailed": 1}。 - 从
fnplus数据集中删除rating和good_word列。
数据集生成代码
数据集通过以下步骤生成:
- 读取数据文件,将数据分割并转换为字典列表,再转换为DataFrame。
- 填充
hypothesis_grammatical和judgement_valid列的空值。 - 修正
index列的数据类型为整数。 - 对所有文本类型列进行strip处理。
- 重命名
entailed列为label。 - 对
label列进行编码,映射为二分类标签。 - 定义数据集的特征结构。
- 选择特征结构中的列。
- 将DataFrame转换为Dataset,并上传至数据集中心。



