five

JINIAC/kuci

收藏
Hugging Face2024-05-10 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/JINIAC/kuci
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-sa-4.0 dataset_info: features: - name: id dtype: int64 - name: context dtype: string - name: choice_a dtype: string - name: choice_b dtype: string - name: choice_c dtype: string - name: choice_d dtype: string - name: label dtype: string - name: agreement dtype: int64 - name: core_event_pair dtype: string - name: conversations list: - name: from dtype: string - name: value dtype: string splits: - name: train num_bytes: 22395366 num_examples: 24682 - name: validation num_bytes: 2711493 num_examples: 2968 - name: test num_bytes: 2776312 num_examples: 3047 download_size: 11821891 dataset_size: 27883171 configs: - config_name: default data_files: - split: train path: data/train-* - split: validation path: data/validation-* - split: test path: data/test-* --- 以下のデータセットのagreement=4(偶発的な関係があることに同意したクラウドワーカーの数が最大)について、conversations(chat_templateで読み込める形式のカラム)を追加して作成しました。 https://github.com/ku-nlp/KUCI ## Reference/Citation [1] (Omura et al., 2020) ``` @inproceedings{omura-etal-2020-method, title = "{A} {M}ethod for {B}uilding a {C}ommonsense {I}nference {D}ataset based on {B}asic {E}vents", author = "Omura, Kazumasa and Kawahara, Daisuke and Kurohashi, Sadao", booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)", month = nov, year = "2020", address = "Online", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2020.emnlp-main.192/", doi = "10.18653/v1/2020.emnlp-main.192", pages = "2450--2460", } ``` [2] (Omura & Kurohashi, 2022) ``` @inproceedings{omura-kurohashi-2022-improving, title = "{I}mproving {C}ommonsense {C}ontingent {R}easoning by {P}seudo-data and its {A}pplication to the {R}elated {T}asks", author = "Omura, Kazumasa and Kurohashi, Sadao", booktitle = "Proceedings of the 29th International Conference on Computational Linguistics", month = oct, year = "2022", address = "Gyeongju, Republic of Korea", publisher = "International Committee on Computational Linguistics", url = "https://aclanthology.org/2022.coling-1.68/", pages = "812--823", } ``` [3] (Omura et al., 2023) ``` @article{omura-etal-2023-building, title = "{B}uilding a {C}ommonsense {I}nference {D}ataset based on {B}asic {E}vents and its {A}pplication", author = "Omura, Kazumasa and Kawahara, Daisuke and Kurohashi, Sadao", journal = "Journal of Natural Language Processing", volume = "30", number = "4", year = "2023", doi = "10.5715/jnlp.30.1206", pages = "1206-1239", note = "(in Japanese)", } ```
提供机构:
JINIAC
原始信息汇总

数据集概述

数据集特征

  • id: 整数类型 (int64)
  • context: 字符串类型 (string)
  • choice_a: 字符串类型 (string)
  • choice_b: 字符串类型 (string)
  • choice_c: 字符串类型 (string)
  • choice_d: 字符串类型 (string)
  • label: 字符串类型 (string)
  • agreement: 整数类型 (int64)
  • core_event_pair: 字符串类型 (string)
  • conversations: 列表类型,包含以下字段:
    • from: 字符串类型 (string)
    • value: 字符串类型 (string)

数据集分割

  • train: 24682个样本,占用22395366字节
  • validation: 2968个样本,占用2711493字节
  • test: 3047个样本,占用2776312字节

数据集大小

  • 下载大小: 11821891字节
  • 数据集总大小: 27883171字节

配置

  • config_name: default
  • data_files:
    • train: 路径为data/train-*
    • validation: 路径为data/validation-*
    • test: 路径为data/test-*

数据集版权

  • 许可证: CC-BY-SA-4.0
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作