SlimPajama-Meta-rater-Readability-30B
收藏魔搭社区2025-12-04 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/OpenDataLab/SlimPajama-Meta-rater-Readability-30B
下载链接
链接失效反馈官方服务:
资源简介:
# Top 30B token SlimPajama Subset selected by the Readability rater
This repository contains the dataset described in the paper [Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models](https://huggingface.co/papers/2504.14194).
Code: https://github.com/opendatalab/Meta-rater
## Dataset Description
This dataset contains the top 30B tokens from the SlimPajama-627B corpus, selected using the **Readability** dimension of the PRRC (Professionalism, Readability, Reasoning, Cleanliness) framework. Each document in this subset is scored and filtered by a ModernBERT-based rater fine-tuned to assess the clarity, coherence, and ease of understanding of the text.
- **Source**: SlimPajama-627B Annotated Dataset
- **Selection**: Top 30B tokens by PRRC-Readability score
- **Quality metric**: Readability (0–5 scale, see below)
- **Annotation coverage**: 100% of selected subset
## Dataset Statistics
- **Total tokens**: 30B (subset of SlimPajama-627B)
- **Selection method**: Top-ranked by PRRC-Readability ModernBERT rater
- **Domains**: Same as SlimPajama (CommonCrawl, C4, GitHub, Books, ArXiv, Wikipedia, StackExchange)
- **Annotation**: Each document has a readability score (0–5)
## Readability Quality Metric
**Readability** evaluates the clarity, coherence, and ease of understanding of the text. Higher scores indicate content that is clear, well-structured, and easy to follow, while lower scores reflect text that is difficult to comprehend due to poor structure, grammar, or vocabulary.
- **0–1**: Significant issues with clarity or coherence; difficult to read
- **2–3**: Generally clear but with some sections that are hard to understand
- **4–5**: Very clear, coherent, and easy to read
Scores are assigned by a ModernBERT model fine-tuned on Llama-3.3-70B-Instruct annotations, as described in the Meta-rater paper.
## Annotation Process
- **Initial annotation**: Llama-3.3-70B-Instruct rated 500k+ SlimPajama samples for readability
- **Model training**: ModernBERT fine-tuned on these annotations
- **Scoring**: All SlimPajama documents scored by ModernBERT; top 30B tokens selected
## Citation
If you use this dataset, please cite:
```bibtex
@article{zhuang2025meta,
title={Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models},
author={Zhuang, Xinlin and Peng, Jiahui and Ma, Ren and Wang, Yinfan and Bai, Tianyi and Wei, Xingjian and Qiu, Jiantao and Zhang, Chi and Qian, Ying and He, Conghui},
journal={arXiv preprint arXiv:2504.14194},
year={2025}
}
```
## License
This dataset is released under the same license as the original SlimPajama dataset. See the original SlimPajama repository for details.
## Contact
- **Project Lead**: Ren Ma (maren@pjlab.org.cn)
- **Corresponding Author**: Conghui He (heconghui@pjlab.org.cn)
- **Issues**: [GitHub Issues](https://github.com/opendatalab/Meta-rater/issues)
---
**Made with ❤️ by the OpenDataLab team**
# 基于可读性评分器筛选的Top 300亿Token SlimPajama子集
本仓库包含论文《元评分器:面向大语言模型预训练的多维度数据筛选方法》(Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models)所描述的数据集,论文链接:https://huggingface.co/papers/2504.14194。
代码链接:https://github.com/opendatalab/Meta-rater
## 数据集说明
本数据集源自SlimPajama-627B语料库,从中筛选出Top 300亿Token子集,筛选维度采用PRRC(专业性、可读性、推理能力、整洁度)框架中的**可读性**维度。该子集内的每份文档均由经过微调的ModernBERT评分器进行打分与过滤,该评分器用于评估文本的清晰度、连贯性与易理解性。
- **数据来源**:SlimPajama-627B标注数据集
- **筛选规则**:基于PRRC-可读性评分选取Top 300亿Token
- **质量指标**:可读性(0–5分制,详见下文)
- **标注覆盖范围**:选中子集全量覆盖
## 数据集统计信息
- **总Token数**:300亿(SlimPajama-627B的子集)
- **筛选方法**:通过PRRC-可读性ModernBERT评分器排序取Top
- **数据领域**:与SlimPajama一致,包含CommonCrawl、C4、GitHub、书籍、ArXiv、维基百科、StackExchange
- **标注信息**:每份文档均附带可读性评分(0–5)
## 可读性质量指标
**可读性**指标用于评估文本的清晰度、连贯性与易理解性。评分越高,代表内容越清晰、结构越合理、越易于跟随阅读;评分越低,则代表文本因结构混乱、语法错误或词汇不当而难以理解。
- **0–1分**:清晰度或连贯性存在严重问题,可读性极差
- **2–3分**:整体清晰,但存在部分难以理解的段落
- **4–5分**:清晰度极高、结构连贯,易于阅读
评分由基于Llama-3.3-70B-Instruct标注数据微调后的ModernBERT模型完成,具体细节详见《元评分器》论文。
## 标注流程
- **初始标注**:使用Llama-3.3-70B-Instruct对50万+条SlimPajama样本进行可读性标注
- **模型训练**:基于上述标注数据微调ModernBERT模型
- **评分与筛选**:对所有SlimPajama文档进行ModernBERT评分,选取Top 300亿Token
## 引用格式
若使用本数据集,请引用以下文献:
bibtex
@article{zhuang2025meta,
title={Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models},
author={Zhuang, Xinlin and Peng, Jiahui and Ma, Ren and Wang, Yinfan and Bai, Tianyi and Wei, Xingjian and Qiu, Jiantao and Zhang, Chi and Qian, Ying and He, Conghui},
journal={arXiv preprint arXiv:2504.14194},
year={2025}
}
## 开源协议
本数据集采用与原始SlimPajama数据集一致的开源协议,具体细节请参阅原始SlimPajama仓库。
## 联系方式
- **项目负责人**:马仁(maren@pjlab.org.cn)
- **通讯作者**:何聪辉(heconghui@pjlab.org.cn)
- **问题反馈**:[GitHub Issues](https://github.com/opendatalab/Meta-rater/issues)
---
**由OpenDataLab团队倾情制作**
提供机构:
maas
创建时间:
2025-11-26



