five

yoonholee/combined-preference-dataset

收藏
Hugging Face2024-07-06 更新2024-06-29 收录
下载链接:
https://hf-mirror.com/datasets/yoonholee/combined-preference-dataset
下载链接
链接失效反馈
官方服务:
资源简介:
该数据集是一个组合偏好数据集,所有示例都经过二值化和标准化处理,适用于`tokenizer.apply_chat_template()`。数据集包含两个主要部分:rejected和chosen,每个部分都包含content和role两个字段。此外,数据集还包括source、source_sub和metadata字段,其中metadata包含多个子字段,如coherence、complexity、correctness等,用于评估数据的质量。数据集分为训练集和测试集,分别包含537265和59697个示例。数据集的总下载大小为1150549338字节,总数据集大小为2119150331.0字节。数据集的来源包括多个公开的偏好数据集,如UltraFeedback、CodeUltraFeedback、HelpSteer2等。

This dataset is a combined preference dataset, with all examples binarized and standardized for `tokenizer.apply_chat_template()`. The dataset includes two main parts: rejected and chosen, each containing content and role fields. Additionally, the dataset includes source, source_sub, and metadata fields, where metadata contains multiple subfields such as coherence, complexity, correctness, etc., for assessing data quality. The dataset is divided into training and test sets, containing 537265 and 59697 examples respectively. The total download size of the dataset is 1150549338 bytes, and the total dataset size is 2119150331.0 bytes. The dataset sources include multiple publicly available preference datasets such as UltraFeedback, CodeUltraFeedback, HelpSteer2, etc.
提供机构:
yoonholee
原始信息汇总

数据集概述

数据集信息

特征

  • rejected:
    • content: string
    • role: string
  • chosen:
    • content: string
    • role: string
  • source: string
  • source_sub: string
  • metadata:
    • coherence: sequence of string
    • complexity: sequence of string
    • correctness: sequence of string
    • helpfulness: sequence of string
    • honesty: sequence of string
    • instruction_following: sequence of string
    • length: sequence of int64
    • preference: string
    • truthfulness: sequence of string
    • verbosity: sequence of string

数据分割

  • train:
    • num_bytes: 1907232457.986798
    • num_examples: 537265
  • test:
    • num_bytes: 211917873.01320183
    • num_examples: 59697

数据集大小

  • download_size: 1150549338
  • dataset_size: 2119150331.0

配置

  • config_name: default
    • data_files:
      • split: train, path: data/train-*
      • split: test, path: data/test-*

数据集来源

  • openbmb/UltraFeedback
  • coseal/CodeUltraFeedback
  • nvidia/HelpSteer2
  • PKU-Alignment/PKU-SafeRLHF
  • argilla/Capybara-Preferences-Filtered
  • argilla/distilabel-intel-orca-dpo-pairs
  • argilla/distilabel-math-preference-dpo
  • stanfordnlp/SHP
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作