SlimPajama-Meta-rater-Readability-30B

Name: SlimPajama-Meta-rater-Readability-30B
Creator: maas
Published: 2025-12-04 16:56:57
License: 暂无描述

魔搭社区2025-12-04 更新2025-12-06 收录

下载链接：

https://modelscope.cn/datasets/OpenDataLab/SlimPajama-Meta-rater-Readability-30B

下载链接

链接失效反馈

官方服务：

资源简介：

# Top 30B token SlimPajama Subset selected by the Readability rater This repository contains the dataset described in the paper [Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models](https://huggingface.co/papers/2504.14194). Code: https://github.com/opendatalab/Meta-rater ## Dataset Description This dataset contains the top 30B tokens from the SlimPajama-627B corpus, selected using the **Readability** dimension of the PRRC (Professionalism, Readability, Reasoning, Cleanliness) framework. Each document in this subset is scored and filtered by a ModernBERT-based rater fine-tuned to assess the clarity, coherence, and ease of understanding of the text. - **Source**: SlimPajama-627B Annotated Dataset - **Selection**: Top 30B tokens by PRRC-Readability score - **Quality metric**: Readability (0–5 scale, see below) - **Annotation coverage**: 100% of selected subset ## Dataset Statistics - **Total tokens**: 30B (subset of SlimPajama-627B) - **Selection method**: Top-ranked by PRRC-Readability ModernBERT rater - **Domains**: Same as SlimPajama (CommonCrawl, C4, GitHub, Books, ArXiv, Wikipedia, StackExchange) - **Annotation**: Each document has a readability score (0–5) ## Readability Quality Metric **Readability** evaluates the clarity, coherence, and ease of understanding of the text. Higher scores indicate content that is clear, well-structured, and easy to follow, while lower scores reflect text that is difficult to comprehend due to poor structure, grammar, or vocabulary. - **0–1**: Significant issues with clarity or coherence; difficult to read - **2–3**: Generally clear but with some sections that are hard to understand - **4–5**: Very clear, coherent, and easy to read Scores are assigned by a ModernBERT model fine-tuned on Llama-3.3-70B-Instruct annotations, as described in the Meta-rater paper. ## Annotation Process - **Initial annotation**: Llama-3.3-70B-Instruct rated 500k+ SlimPajama samples for readability - **Model training**: ModernBERT fine-tuned on these annotations - **Scoring**: All SlimPajama documents scored by ModernBERT; top 30B tokens selected ## Citation If you use this dataset, please cite: ```bibtex @article{zhuang2025meta, title={Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models}, author={Zhuang, Xinlin and Peng, Jiahui and Ma, Ren and Wang, Yinfan and Bai, Tianyi and Wei, Xingjian and Qiu, Jiantao and Zhang, Chi and Qian, Ying and He, Conghui}, journal={arXiv preprint arXiv:2504.14194}, year={2025} } ``` ## License This dataset is released under the same license as the original SlimPajama dataset. See the original SlimPajama repository for details. ## Contact - **Project Lead**: Ren Ma (maren@pjlab.org.cn) - **Corresponding Author**: Conghui He (heconghui@pjlab.org.cn) - **Issues**: [GitHub Issues](https://github.com/opendatalab/Meta-rater/issues) --- **Made with ❤️ by the OpenDataLab team**

# 基于可读性评分器筛选的Top 300亿Token SlimPajama子集本仓库包含论文《元评分器：面向大语言模型预训练的多维度数据筛选方法》（Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models）所描述的数据集，论文链接：https://huggingface.co/papers/2504.14194。代码链接：https://github.com/opendatalab/Meta-rater ## 数据集说明本数据集源自SlimPajama-627B语料库，从中筛选出Top 300亿Token子集，筛选维度采用PRRC（专业性、可读性、推理能力、整洁度）框架中的**可读性**维度。该子集内的每份文档均由经过微调的ModernBERT评分器进行打分与过滤，该评分器用于评估文本的清晰度、连贯性与易理解性。 - **数据来源**：SlimPajama-627B标注数据集 - **筛选规则**：基于PRRC-可读性评分选取Top 300亿Token - **质量指标**：可读性（0–5分制，详见下文） - **标注覆盖范围**：选中子集全量覆盖 ## 数据集统计信息 - **总Token数**：300亿（SlimPajama-627B的子集） - **筛选方法**：通过PRRC-可读性ModernBERT评分器排序取Top - **数据领域**：与SlimPajama一致，包含CommonCrawl、C4、GitHub、书籍、ArXiv、维基百科、StackExchange - **标注信息**：每份文档均附带可读性评分（0–5） ## 可读性质量指标 **可读性**指标用于评估文本的清晰度、连贯性与易理解性。评分越高，代表内容越清晰、结构越合理、越易于跟随阅读；评分越低，则代表文本因结构混乱、语法错误或词汇不当而难以理解。 - **0–1分**：清晰度或连贯性存在严重问题，可读性极差 - **2–3分**：整体清晰，但存在部分难以理解的段落 - **4–5分**：清晰度极高、结构连贯，易于阅读评分由基于Llama-3.3-70B-Instruct标注数据微调后的ModernBERT模型完成，具体细节详见《元评分器》论文。 ## 标注流程 - **初始标注**：使用Llama-3.3-70B-Instruct对50万+条SlimPajama样本进行可读性标注 - **模型训练**：基于上述标注数据微调ModernBERT模型 - **评分与筛选**：对所有SlimPajama文档进行ModernBERT评分，选取Top 300亿Token ## 引用格式若使用本数据集，请引用以下文献： bibtex @article{zhuang2025meta, title={Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models}, author={Zhuang, Xinlin and Peng, Jiahui and Ma, Ren and Wang, Yinfan and Bai, Tianyi and Wei, Xingjian and Qiu, Jiantao and Zhang, Chi and Qian, Ying and He, Conghui}, journal={arXiv preprint arXiv:2504.14194}, year={2025} } ## 开源协议本数据集采用与原始SlimPajama数据集一致的开源协议，具体细节请参阅原始SlimPajama仓库。 ## 联系方式 - **项目负责人**：马仁（maren@pjlab.org.cn） - **通讯作者**：何聪辉（heconghui@pjlab.org.cn） - **问题反馈**：[GitHub Issues](https://github.com/opendatalab/Meta-rater/issues) --- **由OpenDataLab团队倾情制作**

提供机构：

maas

创建时间：

2025-11-26

5,000+

优质数据集

54 个

任务类型

进入经典数据集