sumuks/helpsteer3-dpo-style
收藏Hugging Face2026-03-26 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/sumuks/helpsteer3-dpo-style
下载链接
链接失效反馈官方服务:
资源简介:
---
pretty_name: HelpSteer3 DPO Style
language:
- en
- zh
- ko
- fr
- es
- ru
- ja
- de
- it
- pt
- pl
- id
- nl
- vi
license: cc-by-4.0
task_categories:
- text-generation
tags:
- dpo
- preference-optimization
- human-feedback
- reward-modeling
- helpsteer
- nvidia
size_categories:
- 10K<n<100K
configs:
- config_name: default
data_files:
- split: train
path: train-00000-of-00001.parquet
- split: test
path: test-00000-of-00001.parquet
---
# Dataset Card for HelpSteer3 DPO Style
## Dataset Summary
This dataset is derived from `nvidia/HelpSteer3` using the `preference` config.
Each source row contains a chat-style `context`, two candidate responses, and a signed `overall_preference` score.
This conversion keeps only strong preferences with `abs(overall_preference) >= 2` and maps them into DPO-style `chosen` and `rejected` rows.
The original HelpSteer3 `validation` split is written as `test` here to match the train/test convention used elsewhere in this repo.
## Dataset Structure
- Train source rows: 38459
- Test source rows: 2017
- Train DPO rows: 23959
- Test DPO rows: 1288
- Total DPO rows: 25247
- Dropped weak-preference rows: 15229
Each row contains these key fields:
- `prompt`: Rendered conversation transcript from the source `context`.
- `chosen`: Preferred assistant response chosen from `response1` or `response2`.
- `rejected`: Less-preferred assistant response chosen from the other source response.
- `difficulty`: `1 / abs(overall_preference)`, so `0.5` for `±2` and `0.333...` for `±3`.
## Construction Notes
- Negative `overall_preference` values mean `response1` is preferred.
- Positive `overall_preference` values mean `response2` is preferred.
- Rows with scores `-1`, `0`, and `1` are dropped as too weak for this dataset.
- Difficulty uses the absolute preference value because the sign only indicates which side won, not how hard the pair is.
- No prompt-level regrouping is needed because HelpSteer3 already ships train and validation splits.
提供机构:
sumuks



