QuRatedPajama-260B
收藏魔搭社区2025-12-05 更新2025-10-04 收录
下载链接:
https://modelscope.cn/datasets/princeton-nlp/QuRatedPajama-260B
下载链接
链接失效反馈官方服务:
资源简介:
## QuRatedPajama
**Paper:** [QuRating: Selecting High-Quality Data for Training Language Models](https://arxiv.org/pdf/2402.09739.pdf)
A 260B token subset of [cerebras/SlimPajama-627B](https://huggingface.co/datasets/cerebras/SlimPajama-627B), annotated by [princeton-nlp/QuRater-1.3B](https://huggingface.co/princeton-nlp/QuRater-1.3B/tree/main) with sequence-level quality ratings across 4 criteria:
- **Educational Value** - e.g. the text includes clear explanations, step-by-step reasoning, or questions and answers
- **Facts & Trivia** - how much factual and trivia knowledge the text contains, where specific facts and obscure trivia are preferred over more common knowledge
- **Writing Style** - how polished and good is the writing style in the text
- **Required Expertise**: - how much required expertise and prerequisite knowledge is necessary to understand the text
In a pre-processing step, we split documents in into chunks of exactly 1024 tokens. We provide tokenization with the Llama-2 tokenizer in the `input_ids` column.
**Guidance on Responsible Use:**
In the paper, we document various types of bias that are present in the quality ratings (biases related to domains, topics, social roles, regions and languages - see Section 6 of the paper).
Hence, be aware that data selection with QuRating could have unintended and harmful effects on the language model that is being trained.
We strongly recommend a comprehensive evaluation of the language model for these and other types of bias, particularly before real-world deployment.
We hope that releasing the data/models can facilitate future research aimed at uncovering and mitigating such biases.
Note that the quality ratings do not measure the social or literary value of a text and should *not* be used for textual or demographic studies.
**Citation:**
```
@article{wettig2024qurating,
title={QuRating: Selecting High-Quality Data for Training Language Models},
author={Alexander Wettig, Aatmik Gupta, Saumya Malik, Danqi Chen},
journal={arXiv preprint 2402.09739},
year={2024}
}
```
# QuRatedPajama
**论文:** [QuRating:用于训练语言模型的高质量数据筛选方法](https://arxiv.org/pdf/2402.09739.pdf)
该数据集是[cerebras/SlimPajama-627B](https://huggingface.co/datasets/cerebras/SlimPajama-627B)的2600亿Token (Token)子集,由[princeton-nlp/QuRater-1.3B](https://huggingface.co/princeton-nlp/QuRater-1.3B/tree/main)完成标注,基于4项标准给出序列级质量评分:
- **教育价值 (Educational Value)**:例如文本包含清晰的阐释、逐步推导过程或问答内容
- **事实与琐事知识 (Facts & Trivia)**:衡量文本包含的事实与琐事知识量,优先选取包含特定事实与冷僻琐事的文本,而非过于常见的常识内容
- **写作风格 (Writing Style)**:衡量文本写作风格的流畅度与优质程度
- **所需专业知识 (Required Expertise)**:理解该文本所需的专业知识与前置知识量
在预处理阶段,我们将原始文档切分为恰好1024个Token的片段,并使用Llama-2分词器完成分词,结果存储于`input_ids`列中。
**负责任使用指南**:
本论文中,我们记录了质量评分中存在的多种偏差类型(涉及领域、主题、社会角色、地区与语言的偏差——详见论文第6节)。因此,需注意使用QuRating进行数据筛选,可能会对正在训练的大语言模型 (Large Language Model, LLM)产生意料之外的有害影响。我们强烈建议在将语言模型投入实际部署前,针对此类及其他类型的偏差开展全面评估。我们期望通过公开该数据集与模型,能够推动未来旨在揭示与缓解此类偏差的研究工作。请注意,质量评分并未衡量文本的社会价值或文学价值,**不得**将其用于文本或人口统计学相关研究。
**引用格式:**
@article{wettig2024qurating,
title={QuRating: Selecting High-Quality Data for Training Language Models},
author={Alexander Wettig, Aatmik Gupta, Saumya Malik, Danqi Chen},
journal={arXiv preprint 2402.09739},
year={2024}
}
提供机构:
maas
创建时间:
2025-08-16



