five

QuRating-GPT3.5-Judgments

收藏
魔搭社区2025-12-05 更新2025-10-04 收录
下载链接:
https://modelscope.cn/datasets/princeton-nlp/QuRating-GPT3.5-Judgments
下载链接
链接失效反馈
官方服务:
资源简介:
*250K thousand pairwise judgments across 4 criteria obtained by prompting GPT-3.5-turbo-0613.* From the paper: [QuRating: Selecting High-Quality Data for Training Language Models](https://arxiv.org/abs/2402.09739) **_Guidance on Responsible Use_** In the paper, we document various types of bias that are present in the quality ratings/QuRater model (biases related to domains, topics, social roles, regions and languages - see Section 6 of the paper), which are likely reflected in the LLM judgments. Hence, be aware that data selection with QuRating could have unintended and harmful effects on the language model that is being trained. We strongly recommend a comprehensive evaluation of the language model for these and other types of bias, particularly before real-world deployment. We hope that releasing the data/models can facilitate future research aimed at uncovering and mitigating such biases. #### Dataset columns * `texts`: A list of two text snippets * For each criteria (`writing_style`, `facts_and_trivia`, `educational_value`, `required_expertise`) we have four fields: * `{criteria}_votes_b`: Vote matrix where the value at indices *(a,b)* denote the number of votes for the text at index *b* * `{criteria}_votes_a`: Vote matrix where the value at indices *(a,b)* denote the number of votes for the text at index *a* * `{criteria}_average`: Averaged votes matrix where the value at indices *(a,b)* corresponds to *p(`text_b` > `text_a`)*. We normalize the matrix such that the sum with its transpose is equal to 1.0. Value of -100 are along the diagonal and where we didn't receive enough votes due to Azure content filters. * For practical purposes: ``` criteria = "educational_value" # for example text_a, text_b = dataset[index]["texts"] probability_b_over_a = dataset[index][f"{criteria}_average"][0][1] ``` * `source_domains`: A list of the original RedPajama sets of the text snippets <!-- --- dataset_info: features: - name: texts sequence: string - name: educational_value_votes_a sequence: sequence: int64 - name: educational_value_votes_b sequence: sequence: int64 - name: educational_value_average sequence: sequence: float64 - name: facts_and_trivia_votes_a sequence: sequence: int64 - name: facts_and_trivia_votes_b sequence: sequence: int64 - name: facts_and_trivia_average sequence: sequence: float64 - name: required_expertise_votes_a sequence: sequence: int64 - name: required_expertise_votes_b sequence: sequence: int64 - name: required_expertise_average sequence: sequence: float64 - name: writing_style_votes_a sequence: sequence: int64 - name: writing_style_votes_b sequence: sequence: int64 - name: writing_style_average sequence: sequence: float64 - name: source_domains sequence: string splits: - name: train num_bytes: 958913973 num_examples: 250000 download_size: 528826656 dataset_size: 958913973 configs: - config_name: default data_files: - split: train path: data/train-* --- -->

* 基于4项评估准则,通过提示GPT-3.5-turbo-0613生成的25万组成对比较标注数据。 源自论文:《QuRating:为语言模型训练筛选高质量数据》,论文链接:https://arxiv.org/abs/2402.09739 **负责任使用指南** 本论文中,我们详细记录了质量评分/QuRater模型中存在的多种偏见类型(涉及领域、主题、社会角色、地区与语言相关偏见——详见论文第6节),这些偏见大概率也会体现在大语言模型(LLM)的标注判断中。因此,请注意:使用QuRating进行数据筛选,可能会对正在训练的语言模型产生意料之外的有害影响。我们强烈建议,尤其是在实际部署前,针对此类及其他类型的偏见对语言模型开展全面评估。我们期望通过公开该数据集与模型,能够推动后续旨在揭示并缓解此类偏见的研究工作。 #### 数据集字段说明 * `texts`:包含两段文本片段的列表 * 针对每一项评估准则(`writing_style`(写作风格)、`facts_and_trivia`(事实与百科知识)、`educational_value`(教育价值)、`required_expertise`(所需专业知识)),均包含以下4个字段: * `{criteria}_votes_b`:投票矩阵,索引*(a,b)*处的数值代表针对索引为*b*的文本的投票数 * `{criteria}_votes_a`:投票矩阵,索引*(a,b)*处的数值代表针对索引为*a*的文本的投票数 * `{criteria}_average`:平均投票矩阵,索引*(a,b)*处的数值对应*p(`text_b` > `text_a`)*(即文本b优于文本a的概率)。我们对该矩阵进行归一化处理,使其与其转置矩阵的总和为1.0。由于Azure内容过滤导致投票不足的位置,以及矩阵对角线上的数值均为-100。 * 为便于实际使用: criteria = "educational_value" # 以教育价值准则为例 text_a, text_b = dataset[index]["texts"] probability_b_over_a = dataset[index][f"{criteria}_average"][0][1] * `source_domains`:文本片段所属的原始RedPajama数据集域列表 <!-- --- dataset_info: 数据集特征: - 名称:texts 类型:字符串序列 - 名称:educational_value_votes_a 类型:64位整数序列的嵌套序列 - 名称:educational_value_votes_b 类型:64位整数序列的嵌套序列 - 名称:educational_value_average 类型:64位浮点数序列的嵌套序列 - 名称:facts_and_trivia_votes_a 类型:64位整数序列的嵌套序列 - 名称:facts_and_trivia_votes_b 类型:64位整数序列的嵌套序列 - 名称:facts_and_trivia_average 类型:64位浮点数序列的嵌套序列 - 名称:required_expertise_votes_a 类型:64位整数序列的嵌套序列 - 名称:required_expertise_votes_b 类型:64位整数序列的嵌套序列 - 名称:required_expertise_average 类型:64位浮点数序列的嵌套序列 - 名称:writing_style_votes_a 类型:64位整数序列的嵌套序列 - 名称:writing_style_votes_b 类型:64位整数序列的嵌套序列 - 名称:writing_style_average 类型:64位浮点数序列的嵌套序列 - 名称:source_domains 类型:字符串序列 数据集划分: - 划分名称:train 字节数:958913973 样本数量:250000 下载大小:528826656 数据集总大小:958913973 配置项: - 配置名称:default 数据文件: - 划分:train 路径:data/train-* --- -->
提供机构:
maas
创建时间:
2025-08-15
搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
QuRating-GPT3.5-Judgments是一个由princeton-nlp发布的数据集,包含25万对文本片段的成对判断,这些判断基于四个标准(写作风格、事实与琐事、教育价值、所需专业知识),通过GPT-3.5-turbo-0613模型生成。数据集旨在支持语言模型训练中的数据选择研究,但需注意其可能存在的偏见,建议在使用前进行综合评估以确保负责任的应用。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作