sumuks/helpsteer3-dpo-style

Name: sumuks/helpsteer3-dpo-style
Creator: sumuks
Published: 2026-03-26 02:40:05
License: 暂无描述

Hugging Face2026-03-26 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/sumuks/helpsteer3-dpo-style

下载链接

链接失效反馈

官方服务：

资源简介：

--- pretty_name: HelpSteer3 DPO Style language: - en - zh - ko - fr - es - ru - ja - de - it - pt - pl - id - nl - vi license: cc-by-4.0 task_categories: - text-generation tags: - dpo - preference-optimization - human-feedback - reward-modeling - helpsteer - nvidia size_categories: - 10K<n<100K configs: - config_name: default data_files: - split: train path: train-00000-of-00001.parquet - split: test path: test-00000-of-00001.parquet --- # Dataset Card for HelpSteer3 DPO Style ## Dataset Summary This dataset is derived from `nvidia/HelpSteer3` using the `preference` config. Each source row contains a chat-style `context`, two candidate responses, and a signed `overall_preference` score. This conversion keeps only strong preferences with `abs(overall_preference) >= 2` and maps them into DPO-style `chosen` and `rejected` rows. The original HelpSteer3 `validation` split is written as `test` here to match the train/test convention used elsewhere in this repo. ## Dataset Structure - Train source rows: 38459 - Test source rows: 2017 - Train DPO rows: 23959 - Test DPO rows: 1288 - Total DPO rows: 25247 - Dropped weak-preference rows: 15229 Each row contains these key fields: - `prompt`: Rendered conversation transcript from the source `context`. - `chosen`: Preferred assistant response chosen from `response1` or `response2`. - `rejected`: Less-preferred assistant response chosen from the other source response. - `difficulty`: `1 / abs(overall_preference)`, so `0.5` for `±2` and `0.333...` for `±3`. ## Construction Notes - Negative `overall_preference` values mean `response1` is preferred. - Positive `overall_preference` values mean `response2` is preferred. - Rows with scores `-1`, `0`, and `1` are dropped as too weak for this dataset. - Difficulty uses the absolute preference value because the sign only indicates which side won, not how hard the pair is. - No prompt-level regrouping is needed because HelpSteer3 already ships train and validation splits.

提供机构：

sumuks

5,000+

优质数据集

54 个

任务类型

进入经典数据集