five

pietrolesci/recast_white

收藏
Hugging Face2022-04-22 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/pietrolesci/recast_white
下载链接
链接失效反馈
官方服务:
资源简介:
## Overview This dataset has been introduced by "Inference is Everything: Recasting Semantic Resources into a Unified Evaluation Framework", Aaron Steven White, Pushpendre Rastogi, Kevin Duh, Benjamin Van Durme. IJCNLP, 2017. Original data available [here](https://github.com/decompositional-semantics-initiative/DNC/raw/master/inference_is_everything.zip). ## Dataset curation The following processing is applied - `hypothesis_grammatical` and `judgement_valid` columns are filled with `""` when empty - all columns are stripped - the `entailed` column is renamed `label` - `label` column is encoded with the following mapping `{"not-entailed": 0, "entailed": 1}` - columns `rating` and `good_word` are dropped from `fnplus` dataset ## Code to generate the dataset ```python import pandas as pd from datasets import Features, Value, ClassLabel, Dataset, DatasetDict ds = {} for name in ("fnplus", "sprl", "dpr"): # read data with open(f"<path to files>/{name}_data.txt", "r") as f: data = f.read() data = data.split("\n\n") data = [lines.split("\n") for lines in data] data = [dict([col.split(":", maxsplit=1) for col in line if len(col) > 0]) for line in data] df = pd.DataFrame(data) # fill empty hypothesis_grammatical and judgement_valid df["hypothesis_grammatical"] = df["hypothesis_grammatical"].fillna("") df["judgement_valid"] = df["judgement_valid"].fillna("") # fix dtype df["index"] = df["index"].astype(int) # strip for col in df.select_dtypes(object).columns: df[col] = df[col].str.strip() # rename columns df = df.rename(columns={"entailed": "label"}) # encode labels df["label"] = df["label"].map({"not-entailed": 0, "entailed": 1}) # cast to dataset features = Features({ "provenance": Value(dtype="string", id=None), "index": Value(dtype="int64", id=None), "text": Value(dtype="string", id=None), "hypothesis": Value(dtype="string", id=None), "partof": Value(dtype="string", id=None), "hypothesis_grammatical": Value(dtype="string", id=None), "judgement_valid": Value(dtype="string", id=None), "label": ClassLabel(num_classes=2, names=["not-entailed", "entailed"]), }) # select common columns df = df.loc[:, list(features.keys())] ds[name] = Dataset.from_pandas(df, features=features) ds = DatasetDict(ds) ds.push_to_hub("recast_white", token="<token>") ```
提供机构:
pietrolesci
原始信息汇总

数据集概述

本数据集由Aaron Steven White, Pushpendre Rastogi, Kevin Duh, Benjamin Van Durme在IJCNLP 2017中发表的论文"Inference is Everything: Recasting Semantic Resources into a Unified Evaluation Framework"引入。原始数据可在此链接获取:https://github.com/decompositional-semantics-initiative/DNC/raw/master/inference_is_everything.zip

数据集处理

  • hypothesis_grammaticaljudgement_valid列中的空值填充为""
  • 所有列进行strip处理。
  • entailed列重命名为label
  • label列进行编码,映射为{"not-entailed": 0, "entailed": 1}
  • fnplus数据集中删除ratinggood_word列。

数据集生成代码

数据集通过以下步骤生成:

  1. 读取数据文件,将数据分割并转换为字典列表,再转换为DataFrame。
  2. 填充hypothesis_grammaticaljudgement_valid列的空值。
  3. 修正index列的数据类型为整数。
  4. 对所有文本类型列进行strip处理。
  5. 重命名entailed列为label
  6. label列进行编码,映射为二分类标签。
  7. 定义数据集的特征结构。
  8. 选择特征结构中的列。
  9. 将DataFrame转换为Dataset,并上传至数据集中心。
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作