five

QuRatedPajama-260B

收藏
魔搭社区2025-12-05 更新2025-10-04 收录
下载链接:
https://modelscope.cn/datasets/princeton-nlp/QuRatedPajama-260B
下载链接
链接失效反馈
官方服务:
资源简介:
## QuRatedPajama **Paper:** [QuRating: Selecting High-Quality Data for Training Language Models](https://arxiv.org/pdf/2402.09739.pdf) A 260B token subset of [cerebras/SlimPajama-627B](https://huggingface.co/datasets/cerebras/SlimPajama-627B), annotated by [princeton-nlp/QuRater-1.3B](https://huggingface.co/princeton-nlp/QuRater-1.3B/tree/main) with sequence-level quality ratings across 4 criteria: - **Educational Value** - e.g. the text includes clear explanations, step-by-step reasoning, or questions and answers - **Facts & Trivia** - how much factual and trivia knowledge the text contains, where specific facts and obscure trivia are preferred over more common knowledge - **Writing Style** - how polished and good is the writing style in the text - **Required Expertise**: - how much required expertise and prerequisite knowledge is necessary to understand the text In a pre-processing step, we split documents in into chunks of exactly 1024 tokens. We provide tokenization with the Llama-2 tokenizer in the `input_ids` column. **Guidance on Responsible Use:** In the paper, we document various types of bias that are present in the quality ratings (biases related to domains, topics, social roles, regions and languages - see Section 6 of the paper). Hence, be aware that data selection with QuRating could have unintended and harmful effects on the language model that is being trained. We strongly recommend a comprehensive evaluation of the language model for these and other types of bias, particularly before real-world deployment. We hope that releasing the data/models can facilitate future research aimed at uncovering and mitigating such biases. Note that the quality ratings do not measure the social or literary value of a text and should *not* be used for textual or demographic studies. **Citation:** ``` @article{wettig2024qurating, title={QuRating: Selecting High-Quality Data for Training Language Models}, author={Alexander Wettig, Aatmik Gupta, Saumya Malik, Danqi Chen}, journal={arXiv preprint 2402.09739}, year={2024} } ```

# QuRatedPajama **论文:** [QuRating:用于训练语言模型的高质量数据筛选方法](https://arxiv.org/pdf/2402.09739.pdf) 该数据集是[cerebras/SlimPajama-627B](https://huggingface.co/datasets/cerebras/SlimPajama-627B)的2600亿Token (Token)子集,由[princeton-nlp/QuRater-1.3B](https://huggingface.co/princeton-nlp/QuRater-1.3B/tree/main)完成标注,基于4项标准给出序列级质量评分: - **教育价值 (Educational Value)**:例如文本包含清晰的阐释、逐步推导过程或问答内容 - **事实与琐事知识 (Facts & Trivia)**:衡量文本包含的事实与琐事知识量,优先选取包含特定事实与冷僻琐事的文本,而非过于常见的常识内容 - **写作风格 (Writing Style)**:衡量文本写作风格的流畅度与优质程度 - **所需专业知识 (Required Expertise)**:理解该文本所需的专业知识与前置知识量 在预处理阶段,我们将原始文档切分为恰好1024个Token的片段,并使用Llama-2分词器完成分词,结果存储于`input_ids`列中。 **负责任使用指南**: 本论文中,我们记录了质量评分中存在的多种偏差类型(涉及领域、主题、社会角色、地区与语言的偏差——详见论文第6节)。因此,需注意使用QuRating进行数据筛选,可能会对正在训练的大语言模型 (Large Language Model, LLM)产生意料之外的有害影响。我们强烈建议在将语言模型投入实际部署前,针对此类及其他类型的偏差开展全面评估。我们期望通过公开该数据集与模型,能够推动未来旨在揭示与缓解此类偏差的研究工作。请注意,质量评分并未衡量文本的社会价值或文学价值,**不得**将其用于文本或人口统计学相关研究。 **引用格式:** @article{wettig2024qurating, title={QuRating: Selecting High-Quality Data for Training Language Models}, author={Alexander Wettig, Aatmik Gupta, Saumya Malik, Danqi Chen}, journal={arXiv preprint 2402.09739}, year={2024} }
提供机构:
maas
创建时间:
2025-08-16
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作