QuRatedPajama-260B

Name: QuRatedPajama-260B
Creator: maas
Published: 2025-12-05 11:51:06
License: 暂无描述

魔搭社区2025-12-05 更新2025-10-04 收录

下载链接：

https://modelscope.cn/datasets/princeton-nlp/QuRatedPajama-260B

下载链接

链接失效反馈

官方服务：

资源简介：

## QuRatedPajama **Paper:** [QuRating: Selecting High-Quality Data for Training Language Models](https://arxiv.org/pdf/2402.09739.pdf) A 260B token subset of [cerebras/SlimPajama-627B](https://huggingface.co/datasets/cerebras/SlimPajama-627B), annotated by [princeton-nlp/QuRater-1.3B](https://huggingface.co/princeton-nlp/QuRater-1.3B/tree/main) with sequence-level quality ratings across 4 criteria: - **Educational Value** - e.g. the text includes clear explanations, step-by-step reasoning, or questions and answers - **Facts & Trivia** - how much factual and trivia knowledge the text contains, where specific facts and obscure trivia are preferred over more common knowledge - **Writing Style** - how polished and good is the writing style in the text - **Required Expertise**: - how much required expertise and prerequisite knowledge is necessary to understand the text In a pre-processing step, we split documents in into chunks of exactly 1024 tokens. We provide tokenization with the Llama-2 tokenizer in the `input_ids` column. **Guidance on Responsible Use:** In the paper, we document various types of bias that are present in the quality ratings (biases related to domains, topics, social roles, regions and languages - see Section 6 of the paper). Hence, be aware that data selection with QuRating could have unintended and harmful effects on the language model that is being trained. We strongly recommend a comprehensive evaluation of the language model for these and other types of bias, particularly before real-world deployment. We hope that releasing the data/models can facilitate future research aimed at uncovering and mitigating such biases. Note that the quality ratings do not measure the social or literary value of a text and should *not* be used for textual or demographic studies. **Citation:** ``` @article{wettig2024qurating, title={QuRating: Selecting High-Quality Data for Training Language Models}, author={Alexander Wettig, Aatmik Gupta, Saumya Malik, Danqi Chen}, journal={arXiv preprint 2402.09739}, year={2024} } ```

# QuRatedPajama **论文：** [QuRating：用于训练语言模型的高质量数据筛选方法](https://arxiv.org/pdf/2402.09739.pdf) 该数据集是[cerebras/SlimPajama-627B](https://huggingface.co/datasets/cerebras/SlimPajama-627B)的2600亿Token (Token)子集，由[princeton-nlp/QuRater-1.3B](https://huggingface.co/princeton-nlp/QuRater-1.3B/tree/main)完成标注，基于4项标准给出序列级质量评分： - **教育价值 (Educational Value)**：例如文本包含清晰的阐释、逐步推导过程或问答内容 - **事实与琐事知识 (Facts & Trivia)**：衡量文本包含的事实与琐事知识量，优先选取包含特定事实与冷僻琐事的文本，而非过于常见的常识内容 - **写作风格 (Writing Style)**：衡量文本写作风格的流畅度与优质程度 - **所需专业知识 (Required Expertise)**：理解该文本所需的专业知识与前置知识量在预处理阶段，我们将原始文档切分为恰好1024个Token的片段，并使用Llama-2分词器完成分词，结果存储于`input_ids`列中。 **负责任使用指南**：本论文中，我们记录了质量评分中存在的多种偏差类型（涉及领域、主题、社会角色、地区与语言的偏差——详见论文第6节）。因此，需注意使用QuRating进行数据筛选，可能会对正在训练的大语言模型 (Large Language Model, LLM)产生意料之外的有害影响。我们强烈建议在将语言模型投入实际部署前，针对此类及其他类型的偏差开展全面评估。我们期望通过公开该数据集与模型，能够推动未来旨在揭示与缓解此类偏差的研究工作。请注意，质量评分并未衡量文本的社会价值或文学价值，**不得**将其用于文本或人口统计学相关研究。 **引用格式：** @article{wettig2024qurating, title={QuRating: Selecting High-Quality Data for Training Language Models}, author={Alexander Wettig, Aatmik Gupta, Saumya Malik, Danqi Chen}, journal={arXiv preprint 2402.09739}, year={2024} }

提供机构：

maas

创建时间：

2025-08-16

5,000+

优质数据集

54 个

任务类型

进入经典数据集