QuRating-GPT3.5-Judgments
收藏魔搭社区2025-12-05 更新2025-10-04 收录
下载链接:
https://modelscope.cn/datasets/princeton-nlp/QuRating-GPT3.5-Judgments
下载链接
链接失效反馈官方服务:
资源简介:
*250K thousand pairwise judgments across 4 criteria obtained by prompting GPT-3.5-turbo-0613.*
From the paper: [QuRating: Selecting High-Quality Data for Training Language Models](https://arxiv.org/abs/2402.09739)
**_Guidance on Responsible Use_**
In the paper, we document various types of bias that are present in the quality ratings/QuRater model (biases related to domains, topics, social roles, regions and languages - see Section 6 of the paper),
which are likely reflected in the LLM judgments.
Hence, be aware that data selection with QuRating could have unintended and harmful effects on the language model that is being trained.
We strongly recommend a comprehensive evaluation of the language model for these and other types of bias, particularly before real-world deployment.
We hope that releasing the data/models can facilitate future research aimed at uncovering and mitigating such biases.
#### Dataset columns
* `texts`: A list of two text snippets
* For each criteria (`writing_style`, `facts_and_trivia`, `educational_value`, `required_expertise`) we have four fields:
* `{criteria}_votes_b`: Vote matrix where the value at indices *(a,b)* denote the number of votes for the text at index *b*
* `{criteria}_votes_a`: Vote matrix where the value at indices *(a,b)* denote the number of votes for the text at index *a*
* `{criteria}_average`: Averaged votes matrix where the value at indices *(a,b)* corresponds to *p(`text_b` > `text_a`)*. We normalize the matrix such that the sum with its transpose is equal to 1.0. Value of -100 are along the diagonal and where we didn't receive enough votes due to Azure content filters.
* For practical purposes:
```
criteria = "educational_value" # for example
text_a, text_b = dataset[index]["texts"]
probability_b_over_a = dataset[index][f"{criteria}_average"][0][1]
```
* `source_domains`: A list of the original RedPajama sets of the text snippets
<!--
---
dataset_info:
features:
- name: texts
sequence: string
- name: educational_value_votes_a
sequence:
sequence: int64
- name: educational_value_votes_b
sequence:
sequence: int64
- name: educational_value_average
sequence:
sequence: float64
- name: facts_and_trivia_votes_a
sequence:
sequence: int64
- name: facts_and_trivia_votes_b
sequence:
sequence: int64
- name: facts_and_trivia_average
sequence:
sequence: float64
- name: required_expertise_votes_a
sequence:
sequence: int64
- name: required_expertise_votes_b
sequence:
sequence: int64
- name: required_expertise_average
sequence:
sequence: float64
- name: writing_style_votes_a
sequence:
sequence: int64
- name: writing_style_votes_b
sequence:
sequence: int64
- name: writing_style_average
sequence:
sequence: float64
- name: source_domains
sequence: string
splits:
- name: train
num_bytes: 958913973
num_examples: 250000
download_size: 528826656
dataset_size: 958913973
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
---
-->
* 基于4项评估准则,通过提示GPT-3.5-turbo-0613生成的25万组成对比较标注数据。
源自论文:《QuRating:为语言模型训练筛选高质量数据》,论文链接:https://arxiv.org/abs/2402.09739
**负责任使用指南**
本论文中,我们详细记录了质量评分/QuRater模型中存在的多种偏见类型(涉及领域、主题、社会角色、地区与语言相关偏见——详见论文第6节),这些偏见大概率也会体现在大语言模型(LLM)的标注判断中。因此,请注意:使用QuRating进行数据筛选,可能会对正在训练的语言模型产生意料之外的有害影响。我们强烈建议,尤其是在实际部署前,针对此类及其他类型的偏见对语言模型开展全面评估。我们期望通过公开该数据集与模型,能够推动后续旨在揭示并缓解此类偏见的研究工作。
#### 数据集字段说明
* `texts`:包含两段文本片段的列表
* 针对每一项评估准则(`writing_style`(写作风格)、`facts_and_trivia`(事实与百科知识)、`educational_value`(教育价值)、`required_expertise`(所需专业知识)),均包含以下4个字段:
* `{criteria}_votes_b`:投票矩阵,索引*(a,b)*处的数值代表针对索引为*b*的文本的投票数
* `{criteria}_votes_a`:投票矩阵,索引*(a,b)*处的数值代表针对索引为*a*的文本的投票数
* `{criteria}_average`:平均投票矩阵,索引*(a,b)*处的数值对应*p(`text_b` > `text_a`)*(即文本b优于文本a的概率)。我们对该矩阵进行归一化处理,使其与其转置矩阵的总和为1.0。由于Azure内容过滤导致投票不足的位置,以及矩阵对角线上的数值均为-100。
* 为便于实际使用:
criteria = "educational_value" # 以教育价值准则为例
text_a, text_b = dataset[index]["texts"]
probability_b_over_a = dataset[index][f"{criteria}_average"][0][1]
* `source_domains`:文本片段所属的原始RedPajama数据集域列表
<!--
---
dataset_info:
数据集特征:
- 名称:texts
类型:字符串序列
- 名称:educational_value_votes_a
类型:64位整数序列的嵌套序列
- 名称:educational_value_votes_b
类型:64位整数序列的嵌套序列
- 名称:educational_value_average
类型:64位浮点数序列的嵌套序列
- 名称:facts_and_trivia_votes_a
类型:64位整数序列的嵌套序列
- 名称:facts_and_trivia_votes_b
类型:64位整数序列的嵌套序列
- 名称:facts_and_trivia_average
类型:64位浮点数序列的嵌套序列
- 名称:required_expertise_votes_a
类型:64位整数序列的嵌套序列
- 名称:required_expertise_votes_b
类型:64位整数序列的嵌套序列
- 名称:required_expertise_average
类型:64位浮点数序列的嵌套序列
- 名称:writing_style_votes_a
类型:64位整数序列的嵌套序列
- 名称:writing_style_votes_b
类型:64位整数序列的嵌套序列
- 名称:writing_style_average
类型:64位浮点数序列的嵌套序列
- 名称:source_domains
类型:字符串序列
数据集划分:
- 划分名称:train
字节数:958913973
样本数量:250000
下载大小:528826656
数据集总大小:958913973
配置项:
- 配置名称:default
数据文件:
- 划分:train
路径:data/train-*
---
-->
提供机构:
maas
创建时间:
2025-08-15
搜集汇总
数据集介绍

背景与挑战
背景概述
QuRating-GPT3.5-Judgments是一个由princeton-nlp发布的数据集,包含25万对文本片段的成对判断,这些判断基于四个标准(写作风格、事实与琐事、教育价值、所需专业知识),通过GPT-3.5-turbo-0613模型生成。数据集旨在支持语言模型训练中的数据选择研究,但需注意其可能存在的偏见,建议在使用前进行综合评估以确保负责任的应用。
以上内容由遇见数据集搜集并总结生成



