five

SlimPajama-Meta-rater-Reasoning-30B

收藏
魔搭社区2025-12-04 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/OpenDataLab/SlimPajama-Meta-rater-Reasoning-30B
下载链接
链接失效反馈
官方服务:
资源简介:
# Top 30B token SlimPajama Subset selected by the Reasoning rater This repository contains the dataset described in the paper [Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models](https://huggingface.co/papers/2504.14194). Code: https://github.com/opendatalab/Meta-rater ## Dataset Description This dataset contains the top 30B tokens from the SlimPajama-627B corpus, selected using the **Reasoning** dimension of the PRRC (Professionalism, Readability, Reasoning, Cleanliness) framework. Each document in this subset is scored and filtered by a ModernBERT-based rater fine-tuned to assess the complexity and depth of logical reasoning required to understand the text. - **Source**: SlimPajama-627B Annotated Dataset - **Selection**: Top 30B tokens by PRRC-Reasoning score - **Quality metric**: Reasoning (0–5 scale, see below) - **Annotation coverage**: 100% of selected subset ## Dataset Statistics - **Total tokens**: 30B (subset of SlimPajama-627B) - **Selection method**: Top-ranked by PRRC-Reasoning ModernBERT rater - **Domains**: Same as SlimPajama (CommonCrawl, C4, GitHub, Books, ArXiv, Wikipedia, StackExchange) - **Annotation**: Each document has a reasoning score (0–5) ## Reasoning Quality Metric **Reasoning** assesses the complexity of logical reasoning and analytical thinking required to understand the text. Higher scores indicate content with multi-step, in-depth, or innovative reasoning, while lower scores reflect simple or superficial logic. - **0–1**: Minimal or superficial reasoning; little analysis - **2–3**: Some logical relationships or basic analysis - **4–5**: High reasoning complexity; multi-step or deep analysis Scores are assigned by a ModernBERT model fine-tuned on Llama-3.3-70B-Instruct annotations, as described in the Meta-rater paper. ## Annotation Process - **Initial annotation**: Llama-3.3-70B-Instruct rated 500k+ SlimPajama samples for reasoning - **Model training**: ModernBERT fine-tuned on these annotations - **Scoring**: All SlimPajama documents scored by ModernBERT; top 30B tokens selected ## Citation If you use this dataset, please cite: ```bibtex @article{zhuang2025meta, title={Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models}, author={Zhuang, Xinlin and Peng, Jiahui and Ma, Ren and Wang, Yinfan and Bai, Tianyi and Wei, Xingjian and Qiu, Jiantao and Zhang, Chi and Qian, Ying and He, Conghui}, journal={arXiv preprint arXiv:2504.14194}, year={2025} } ``` ## License This dataset is released under the same license as the original SlimPajama dataset. See the original SlimPajama repository for details. ## Contact - **Project Lead**: Ren Ma (maren@pjlab.org.cn) - **Corresponding Author**: Conghui He (heconghui@pjlab.org.cn) - **Issues**: [GitHub Issues](https://github.com/opendatalab/Meta-rater/issues) --- **Made with ❤️ by the OpenDataLab team**

# 基于推理评分器筛选的Top 300亿Token SlimPajama子集 本仓库包含论文《Meta-rater:面向大语言模型预训练的多维数据筛选方法》(Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models)中提及的数据集。代码地址:https://github.com/opendatalab/Meta-rater ## 数据集描述 本数据集从SlimPajama-627B语料库中筛选出Top 300亿Token,筛选依据为PRRC框架,其全称为Professionalism(专业性)、Readability(可读性)、Reasoning(推理能力)、Cleanliness(整洁度),本次筛选采用其中的**推理能力**维度。该子集内的每篇文档均由经过微调的基于ModernBERT的评分器进行打分与过滤,该评分器用于评估理解文本所需的逻辑推理复杂度与深度。 - **数据来源**:SlimPajama-627B标注数据集 - **筛选规则**:基于PRRC-推理能力评分排序的Top 300亿Token - **质量度量指标**:推理能力(评分范围0–5,详见下文) - **标注覆盖范围**:选中子集的100%样本 ## 数据集统计信息 - **总Token数**:300亿(SlimPajama-627B的子集) - **筛选方法**:通过PRRC-推理能力ModernBERT评分器排名靠前的样本 - **覆盖领域**:与SlimPajama一致,包括CommonCrawl、C4、GitHub、书籍、ArXiv、Wikipedia、StackExchange - **标注信息**:每篇文档均带有推理能力评分(0–5) ## 推理能力质量度量指标 **推理能力**用于评估理解文本所需的逻辑推理与分析性思维复杂度。评分越高,代表内容包含多步骤、深层次或创新性推理;评分越低,则代表逻辑简单或肤浅。 - **0–1分**:推理程度极低或仅为表面逻辑,几乎无分析过程 - **2–3分**:存在一定逻辑关联或基础分析 - **4–5分**:推理复杂度较高,包含多步骤或深度分析 评分由基于Llama-3.3-70B-Instruct标注数据微调的ModernBERT模型完成,具体细节详见《Meta-rater》论文。 ## 标注流程 - **初始标注**:使用Llama-3.3-70B-Instruct对50万+条SlimPajama样本进行推理能力评分 - **模型训练**:基于上述标注数据微调ModernBERT模型 - **批量评分与筛选**:使用微调后的ModernBERT对所有SlimPajama文档进行评分,筛选出Top 300亿Token ## 引用信息 若使用本数据集,请引用以下文献: bibtex @article{zhuang2025meta, title={Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models}, author={Zhuang, Xinlin and Peng, Jiahui and Ma, Ren and Wang, Yinfan and Bai, Tianyi and Wei, Xingjian and Qiu, Jiantao and Zhang, Chi and Qian, Ying and He, Conghui}, journal={arXiv preprint arXiv:2504.14194}, year={2025} } ## 许可证 本数据集采用与原始SlimPajama数据集一致的许可证,详细信息请参阅原始SlimPajama仓库。 ## 联系方式 - **项目负责人**:马仁(maren@pjlab.org.cn) - **通讯作者**:何聪辉(heconghui@pjlab.org.cn) - **问题反馈**:[GitHub Issues](https://github.com/opendatalab/Meta-rater/issues) --- **由OpenDataLab团队倾情制作**
提供机构:
maas
创建时间:
2025-11-26
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作