yoonholee/combined-preference-dataset

Name: yoonholee/combined-preference-dataset
Creator: yoonholee
Published: 2024-07-06 03:15:04
License: 暂无描述

Hugging Face2024-07-06 更新2024-06-29 收录

下载链接：

https://hf-mirror.com/datasets/yoonholee/combined-preference-dataset

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集是一个组合偏好数据集，所有示例都经过二值化和标准化处理，适用于`tokenizer.apply_chat_template()`。数据集包含两个主要部分：rejected和chosen，每个部分都包含content和role两个字段。此外，数据集还包括source、source_sub和metadata字段，其中metadata包含多个子字段，如coherence、complexity、correctness等，用于评估数据的质量。数据集分为训练集和测试集，分别包含537265和59697个示例。数据集的总下载大小为1150549338字节，总数据集大小为2119150331.0字节。数据集的来源包括多个公开的偏好数据集，如UltraFeedback、CodeUltraFeedback、HelpSteer2等。

This dataset is a combined preference dataset, with all examples binarized and standardized for `tokenizer.apply_chat_template()`. The dataset includes two main parts: rejected and chosen, each containing content and role fields. Additionally, the dataset includes source, source_sub, and metadata fields, where metadata contains multiple subfields such as coherence, complexity, correctness, etc., for assessing data quality. The dataset is divided into training and test sets, containing 537265 and 59697 examples respectively. The total download size of the dataset is 1150549338 bytes, and the total dataset size is 2119150331.0 bytes. The dataset sources include multiple publicly available preference datasets such as UltraFeedback, CodeUltraFeedback, HelpSteer2, etc.

提供机构：

yoonholee

原始信息汇总

数据集概述

数据集信息

特征

rejected:
- content: string
- role: string
chosen:
- content: string
- role: string
source: string
source_sub: string
metadata:
- coherence: sequence of string
- complexity: sequence of string
- correctness: sequence of string
- helpfulness: sequence of string
- honesty: sequence of string
- instruction_following: sequence of string
- length: sequence of int64
- preference: string
- truthfulness: sequence of string
- verbosity: sequence of string

数据分割

train:
- num_bytes: 1907232457.986798
- num_examples: 537265
test:
- num_bytes: 211917873.01320183
- num_examples: 59697

数据集大小

download_size: 1150549338
dataset_size: 2119150331.0

配置

config_name: default
- data_files:
  - split: train, path: data/train-*
  - split: test, path: data/test-*

数据集来源

openbmb/UltraFeedback
coseal/CodeUltraFeedback
nvidia/HelpSteer2
PKU-Alignment/PKU-SafeRLHF
argilla/Capybara-Preferences-Filtered
argilla/distilabel-intel-orca-dpo-pairs
argilla/distilabel-math-preference-dpo
stanfordnlp/SHP

5,000+

优质数据集

54 个

任务类型

进入经典数据集