SlimPajama-Meta-rater
收藏魔搭社区2026-01-06 更新2025-11-29 收录
下载链接:
https://modelscope.cn/datasets/OpenDataLab/SlimPajama-Meta-rater
下载链接
链接失效反馈官方服务:
资源简介:
# Annotated SlimPajama Dataset
## Dataset Description
This dataset contains the **first fully annotated SlimPajama dataset** with comprehensive quality metrics for data-centric large language model research. The dataset includes approximately **580 billion tokens** from the training set of the original SlimPajama dataset, annotated across **25 different quality dimensions**.
**Note**: This dataset contains only the training set portion of the original SlimPajama dataset, which is why the token count is approximately 580B rather than the full 627B tokens.
## Dataset Statistics
- **Total samples**: ~580B tokens from SlimPajama training set
- **Quality metrics**: 25 dimensions across 3 categories
- **Domains**: 7 domains (CommonCrawl, C4, GitHub, Books, ArXiv, Wikipedia, StackExchange)
- **Annotation coverage**: 100% of the training set
## Quality Metrics
The dataset includes 25 quality scores across three main categories:
### 1. Natural Language Quality Signals (11 metrics)
Rule-based measures from RedPajama indicating text naturalness:
- `rps_doc_frac_no_alph_words`: Fraction of words with no alphabetical characters
- `rps_doc_mean_word_length`: Mean word length after normalization
- `rps_doc_frac_unique_words`: Fraction of unique words (degeneracy measure)
- `rps_doc_unigram_entropy`: Entropy of unigram distribution
- `rps_doc_word_count`: Number of words after normalization
- `rps_lines_ending_with_terminal_punctution_mark`: Lines ending with terminal punctuation
- `rps_lines_numerical_chars_fraction`: Ratio of numerical to total characters
- `rps_lines_uppercase_letter_fraction`: Ratio of uppercase to total characters
- `rps_doc_num_sentences`: Number of sentences in content
- `rps_doc_frac_chars_top_2gram`: Fraction of characters in top word 2-gram
- `rps_doc_frac_chars_top_3gram`: Fraction of characters in top word 3-gram
### 2. Data Importance Scores (3 metrics)
DSIR-based importance weights measuring similarity to high-quality domains:
- `dsir_books`: Importance score relative to Books domain
- `dsir_wiki`: Importance score relative to Wikipedia domain
- `dsir_math`: Importance score relative to AutoMathText domain
### 3. Model-based Quality Ratings (11 metrics)
#### Existing Metrics:
- `fineweb_edu`: Educational value (from FineWeb-Edu) - single value in list format
- `ad_en`: Advertisement detection (from WanjuanCC) - logits for binary classification [label_0, label_1]
- `fluency_en`: Fluency assessment (from WanjuanCC) - logits for binary classification [label_0, label_1]
- `qurater`: QuRating scores as a list [Writing Style, Required Expertise, Facts and Trivia, Educational Value]
#### PRRC Framework (Our Contribution):
- `modernbert_professionalism`: Professionalism logits for 6 levels (0-5 scale) - use argmax() to get rating
- `modernbert_readability`: Readability logits for 6 levels (0-5 scale) - use argmax() to get rating
- `modernbert_reasoning`: Reasoning logits for 6 levels (0-5 scale) - use argmax() to get rating
- `modernbert_cleanliness`: Cleanliness logits for 6 levels (0-5 scale) - use argmax() to get rating
## PRRC Framework Details
Our **PRRC** framework introduces four novel dimensions for comprehensive data quality assessment:
- **Professionalism**: Measures the degree of expertise and prerequisite knowledge required to comprehend the text
- **Readability**: Evaluates text clarity, coherence, and ease of understanding
- **Reasoning**: Assesses the complexity of logical reasoning and analytical thinking required
- **Cleanliness**: Evaluates text formatting, completeness, and absence of noise/irrelevant content
Each PRRC dimension uses a 5-point additive rating system, with models achieving F1 scores of 87-92% on test sets.
## Dataset Structure
The dataset structure for each example:
```python
{
"id": "unique_document_id",
"content": "Main text content of the document",
"sub_path": "domain_name", # e.g., "arxiv", "github", "wikipedia", etc.
# Natural Language Quality Signals (RedPajama-style metrics)
"rps_doc_frac_no_alph_words": float,
"rps_doc_mean_word_length": float,
"rps_doc_frac_unique_words": float,
"rps_doc_unigram_entropy": float,
"rps_doc_word_count": int,
"rps_lines_ending_with_terminal_punctution_mark": float,
"rps_lines_numerical_chars_fraction": float,
"rps_lines_uppercase_letter_fraction": float,
"rps_doc_num_sentences": int,
"rps_doc_frac_chars_top_2gram": float,
"rps_doc_frac_chars_top_3gram": float,
# Data Importance Scores (DSIR)
"dsir_books": float,
"dsir_wiki": float,
"dsir_math": float,
# Model-based Quality Ratings
"fineweb_edu": [float], # Single value in list
"ad_en": [float, float], # [has_ad_logit, no_ad_logit] - use argmax() to get 0-1 rating
"fluency_en": [float, float], # [not_fluent_logit, fluent_logit] - use argmax() to get 0-1 rating
"qurater": [float, float, float, float], # [Writing Style, Required Expertise, Facts and Trivia, Educational Value]
# PRRC Framework (Our Contribution) - all contain 6 logits for levels 0-5
"modernbert_professionalism": [float, float, float, float, float, float], # Use argmax() to get 0-5 rating
"modernbert_readability": [float, float, float, float, float, float], # Use argmax() to get 0-5 rating
"modernbert_reasoning": [float, float, float, float, float, float], # Use argmax() to get 0-5 rating
"modernbert_cleanliness": [float, float, float, float, float, float] # Use argmax() to get 0-5 rating
}
```
## Usage
### Loading the Dataset
```python
from datasets import load_dataset
# Load the full dataset
dataset = load_dataset("opendatalab/SlimPajama-627B-Annotated")
# Load a specific split if available
train_dataset = load_dataset("opendatalab/SlimPajama-627B-Annotated", split="train")
```
### Data Processing and Selection Example
```python
import pandas as pd
import numpy as np
from datasets import load_dataset
# Load dataset
dataset = load_dataset("opendatalab/SlimPajama-627B-Annotated", split="train")
# Convert to pandas for easier manipulation
df = dataset.to_pandas()
# Process PRRC scores (convert logits to ratings using argmax)
df['professionalism_score'] = df['modernbert_professionalism'].apply(lambda x: np.argmax(x))
df['readability_score'] = df['modernbert_readability'].apply(lambda x: np.argmax(x))
df['reasoning_score'] = df['modernbert_reasoning'].apply(lambda x: np.argmax(x))
df['cleanliness_score'] = df['modernbert_cleanliness'].apply(lambda x: np.argmax(x))
# Process binary classification scores
df['advertisement_score'] = df['ad_en'].apply(lambda x: np.argmax(x)) # 0 = has ad, 1 = no ad
df['fluency_score'] = df['fluency_en'].apply(lambda x: np.argmax(x)) # 0 = not fluent, 1 = fluent
# Extract QuRating scores
df['writing_style'] = df['qurater'].apply(lambda x: x[0])
df['required_expertise'] = df['qurater'].apply(lambda x: x[1])
df['facts_trivia'] = df['qurater'].apply(lambda x: x[2])
df['educational_value'] = df['qurater'].apply(lambda x: x[3])
# Extract FineWeb-Edu score
df['fineweb_educational'] = df['fineweb_edu'].apply(lambda x: x[0])
# Example: Multi-dimensional quality score combination (Meta-rater approach)
# Using the learned weights from the Meta-rater paper
weights = {
'educational_value': 0.0564, # From qurater[3]
'rps_doc_frac_no_alph_words': 0.0493,
'fineweb_educational': 0.0493,
'rps_lines_uppercase_letter_fraction': 0.0488,
'facts_trivia': 0.0477, # From qurater[2]
'rps_doc_frac_chars_top_3gram': 0.0473,
'rps_lines_ending_with_terminal_punctution_mark': 0.0473,
'rps_doc_frac_chars_top_2gram': 0.0471,
'dsir_wiki': 0.0469,
'rps_lines_numerical_chars_fraction': 0.0460,
'rps_doc_num_sentences': 0.0458,
'dsir_math': 0.0448,
'reasoning_score': 0.0444,
'rps_doc_frac_unique_words': 0.0432,
'rps_doc_word_count': 0.0423,
'rps_doc_unigram_entropy': 0.0422,
'dsir_books': 0.0414,
'professionalism_score': 0.0405,
'fluency_score': 0.0402,
'readability_score': 0.0393,
'required_expertise': 0.0373, # From qurater[1]
'advertisement_score': 0.0368,
'cleanliness_score': 0.0117,
'rps_doc_mean_word_length': 0.0065,
'writing_style': 0.0005, # From qurater[0]
}
# Calculate weighted quality score
quality_score = np.zeros(len(df))
for metric, weight in weights.items():
if metric in df.columns:
quality_score += df[metric].values * weight
# Select top-k samples based on quality score
top_k = 10000
top_k_indices = np.argsort(quality_score)[-top_k:]
selected_data = df.iloc[top_k_indices]
print(f"Selected top {top_k} samples using Meta-rater weights")
```
## Applications
This annotated dataset enables:
1. **Data-Centric LLM Research**: Study the impact of different quality dimensions on model performance
2. **Multi-dimensional Data Selection**: Implement sophisticated data selection strategies beyond single-metric approaches
3. **Quality Score Analysis**: Analyze correlations and relationships between different quality metrics
4. **Benchmark Development**: Create standardized benchmarks for data quality assessment
5. **Efficient Pre-training**: Select high-quality subsets for more efficient model training
6. **Domain-specific Analysis**: Compare quality distributions across different domains (ArXiv, GitHub, Wikipedia, etc.)
## Annotation Process
The quality scores were generated using:
- **Rule-based metrics**: Extracted using established heuristics from RedPajama and DSIR
- **Existing model-based ratings**: Applied pre-trained classifiers from FineWeb-Edu, WanjuanCC, and QuRating
- **PRRC ratings**: Generated using Llama-3.3-70B-Instruct for annotation, followed by fine-tuned ModernBERT models for efficient scoring
## 📚 Citation
If you use Meta-rater in your research, please cite our paper:
```bibtex
@article{zhuang2025meta,
title={Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models},
author={Zhuang, Xinlin and Peng, Jiahui and Ma, Ren and Wang, Yinfan and Bai, Tianyi and Wei, Xingjian and Qiu, Jiantao and Zhang, Chi and Qian, Ying and He, Conghui},
journal={arXiv preprint arXiv:2504.14194},
year={2025}
}
```
## 📄 License
This dataset is released under the same license as the original SlimPajama dataset. Please refer to the original SlimPajama repository for licensing details.
## 🤝 Acknowledgments
This work builds upon:
- **SlimPajama**: The original dataset from Cerebras
- **RedPajama**: Natural language quality signals
- **DSIR**: Data importance scoring methodology
- **FineWeb-Edu**: Educational value assessment
- **WanjuanCC**: Advertisement and fluency detection
- **QuRating**: Multi-dimensional quality rating framework
## 📞 Contact
- **Project Lead**: Ren Ma (maren@pjlab.org.cn)
- **Corresponding Author**: Conghui He (heconghui@pjlab.org.cn)
- **Issues**: Please use [GitHub Issues](https://github.com/opendatalab/Meta-rater/issues) for questions.
---
<div align="center">
**⭐ Star us on GitHub and HuggingFace if you find Meta-rater useful! ⭐**
Made with ❤️ by the OpenDataLab team
</div>
# 标注版SlimPajama数据集
## 数据集概述
本数据集为**首个完整标注版SlimPajama数据集**,包含面向以数据为中心的大语言模型(Large Language Model)研究的全面质量指标。该数据集源自原始SlimPajama数据集的训练集,包含约**5800亿个Token**,并在**25个不同质量维度**上完成标注。
**注意**:本数据集仅包含原始SlimPajama数据集的训练集部分,因此Token数量约为5800亿,而非完整数据集的6270亿。
## 数据集统计
- **总样本量**:SlimPajama训练集中的约5800亿个Token
- **质量指标**:涵盖3大类共25个维度
- **领域覆盖**:7个领域(CommonCrawl、C4、GitHub、Books、ArXiv、Wikipedia、StackExchange)
- **标注覆盖率**:训练集100%覆盖
## 质量指标
本数据集包含3大类共25项质量评分:
### 1. 自然语言质量信号(11项指标)
基于RedPajama的规则化度量,用于表征文本自然度:
- `rps_doc_frac_no_alph_words`:不含字母字符的单词占比
- `rps_doc_mean_word_length`:归一化后的平均单词长度
- `rps_doc_frac_unique_words`:唯一单词占比(退化程度度量)
- `rps_doc_unigram_entropy`:一元语法分布的熵
- `rps_doc_word_count`:归一化后的单词总数
- `rps_lines_ending_with_terminal_punctution_mark`:以终结标点结尾的行占比
- `rps_lines_numerical_chars_fraction`:数字字符占总字符的比例
- `rps_lines_uppercase_letter_fraction`:大写字母占总字符的比例
- `rps_doc_num_sentences`:文本中的句子总数
- `rps_doc_frac_chars_top_2gram`:前2个高频单词的字符占比
- `rps_doc_frac_chars_top_3gram`:前3个高频单词的字符占比
### 2. 数据重要性评分(3项指标)
基于DSIR的重要性权重,用于衡量与高质量领域的相似度:
- `dsir_books`:相较于Books领域的重要性评分
- `dsir_wiki`:相较于Wikipedia领域的重要性评分
- `dsir_math`:相较于AutoMathText领域的重要性评分
### 3. 基于模型的质量评级(11项指标)
#### 现有指标
- `fineweb_edu`:教育价值评分(源自FineWeb-Edu)——以列表格式存储的单一数值
- `ad_en`:广告检测评分(源自WanjuanCC)——二分类任务的Logit值,格式为[label_0, label_1]
- `fluency_en`:流畅度评估(源自WanjuanCC)——二分类任务的Logit值,格式为[label_0, label_1]
- `qurater`:QuRating评分,以列表形式存储,依次为[写作风格、所需专业知识、事实与常识、教育价值]
#### PRRC框架(本研究原创贡献)
- `modernbert_professionalism`:专业度Logit值,共6个等级(0-5量表)——可通过argmax()函数获取最终评级
- `modernbert_readability`:可读性Logit值,共6个等级(0-5量表)——可通过argmax()函数获取最终评级
- `modernbert_reasoning`:推理能力Logit值,共6个等级(0-5量表)——可通过argmax()函数获取最终评级
- `modernbert_cleanliness`:整洁度Logit值,共6个等级(0-5量表)——可通过argmax()函数获取最终评级
## PRRC框架详情
本研究提出的**PRRC框架**引入了4个全新维度,用于实现全面的数据质量评估:
- **专业度(Professionalism)**:衡量理解文本所需的专业程度与前置知识门槛
- **可读性(Readability)**:评估文本的清晰度、连贯性与易懂性
- **推理能力(Reasoning)**:评估所需逻辑推理与分析思维的复杂程度
- **整洁度(Cleanliness)**:评估文本的格式规范性、完整性以及无噪声/无关内容的程度
每个PRRC维度均采用5分加法评级体系,相关模型在测试集上的F1分数可达87%-92%。
## 数据集结构
每个样本的数据集结构如下:
python
{
"id": "唯一文档ID",
"content": "文档的主要文本内容",
"sub_path": "领域名称", # 例如:"arxiv"、"github"、"wikipedia" 等
# 自然语言质量信号(RedPajama风格指标)
"rps_doc_frac_no_alph_words": float,
"rps_doc_mean_word_length": float,
"rps_doc_frac_unique_words": float,
"rps_doc_unigram_entropy": float,
"rps_doc_word_count": int,
"rps_lines_ending_with_terminal_punctution_mark": float,
"rps_lines_numerical_chars_fraction": float,
"rps_lines_uppercase_letter_fraction": float,
"rps_doc_num_sentences": int,
"rps_doc_frac_chars_top_2gram": float,
"rps_doc_frac_chars_top_3gram": float,
# 数据重要性评分(DSIR)
"dsir_books": float,
"dsir_wiki": float,
"dsir_math": float,
# 基于模型的质量评级
"fineweb_edu": [float], # 单一数值的列表格式
"ad_en": [float, float], # [存在广告Logit, 无广告Logit] —— 可通过argmax()获取0-1评级
"fluency_en": [float, float], # [不流畅Logit, 流畅Logit] —— 可通过argmax()获取0-1评级
"qurater": [float, float, float, float], # [写作风格、所需专业知识、事实与常识、教育价值]
# PRRC框架(本研究原创贡献)—— 均包含0-5级共6个Logit值
"modernbert_professionalism": [float, float, float, float, float, float], # 可通过argmax()获取0-5评级
"modernbert_readability": [float, float, float, float, float, float], # 可通过argmax()获取0-5评级
"modernbert_reasoning": [float, float, float, float, float, float], # 可通过argmax()获取0-5评级
"modernbert_cleanliness": [float, float, float, float, float, float] # 可通过argmax()获取0-5评级
}
## 使用方法
### 加载数据集
python
from datasets import load_dataset
# 加载完整数据集
dataset = load_dataset("opendatalab/SlimPajama-627B-Annotated")
# 若需加载指定划分,可使用如下方式
train_dataset = load_dataset("opendatalab/SlimPajama-627B-Annotated", split="train")
### 数据处理与筛选示例
python
import pandas as pd
import numpy as np
from datasets import load_dataset
# 加载数据集
dataset = load_dataset("opendatalab/SlimPajama-627B-Annotated", split="train")
# 将数据集转换为Pandas DataFrame以方便操作
df = dataset.to_pandas()
# 处理PRRC评分:通过argmax()将Logit转换为评级
df['professionalism_score'] = df['modernbert_professionalism'].apply(lambda x: np.argmax(x))
df['readability_score'] = df['modernbert_readability'].apply(lambda x: np.argmax(x))
df['reasoning_score'] = df['modernbert_reasoning'].apply(lambda x: np.argmax(x))
df['cleanliness_score'] = df['modernbert_cleanliness'].apply(lambda x: np.argmax(x))
# 处理二分类评分
df['advertisement_score'] = df['ad_en'].apply(lambda x: np.argmax(x)) # 0 = 存在广告,1 = 无广告
df['fluency_score'] = df['fluency_en'].apply(lambda x: np.argmax(x)) # 0 = 不流畅,1 = 流畅
# 提取QuRating评分
df['writing_style'] = df['qurater'].apply(lambda x: x[0])
df['required_expertise'] = df['qurater'].apply(lambda x: x[1])
df['facts_trivia'] = df['qurater'].apply(lambda x: x[2])
df['educational_value'] = df['qurater'].apply(lambda x: x[3])
# 提取FineWeb-Edu评分
df['fineweb_educational'] = df['fineweb_edu'].apply(lambda x: x[0])
# 示例:多维度质量评分组合(元评分器方法)
# 使用元评分器论文中的学习权重
weights = {
'educational_value': 0.0564, # 源自qurater[3]
'rps_doc_frac_no_alph_words': 0.0493,
'fineweb_educational': 0.0493,
'rps_lines_uppercase_letter_fraction': 0.0488,
'facts_trivia': 0.0477, # 源自qurater[2]
'rps_doc_frac_chars_top_3gram': 0.0473,
'rps_lines_ending_with_terminal_punctution_mark': 0.0473,
'rps_doc_frac_chars_top_2gram': 0.0471,
'dsir_wiki': 0.0469,
'rps_lines_numerical_chars_fraction': 0.0460,
'rps_doc_num_sentences': 0.0458,
'dsir_math': 0.0448,
'reasoning_score': 0.0444,
'rps_doc_frac_unique_words': 0.0432,
'rps_doc_word_count': 0.0423,
'rps_doc_unigram_entropy': 0.0422,
'dsir_books': 0.0414,
'professionalism_score': 0.0405,
'fluency_score': 0.0402,
'readability_score': 0.0393,
'required_expertise': 0.0373, # 源自qurater[1]
'advertisement_score': 0.0368,
'cleanliness_score': 0.0117,
'rps_doc_mean_word_length': 0.0065,
'writing_style': 0.0005, # 源自qurater[0]
}
# 计算加权质量评分
quality_score = np.zeros(len(df))
for metric, weight in weights.items():
if metric in df.columns:
quality_score += df[metric].values * weight
# 基于质量评分选择Top-k样本
top_k = 10000
top_k_indices = np.argsort(quality_score)[-top_k:]
selected_data = df.iloc[top_k_indices]
print(f"使用元评分器权重选择的Top {top_k} 个样本")
## 应用场景
本标注数据集可用于以下研究方向:
1. **以数据为中心的大语言模型研究**:探究不同质量维度对模型性能的影响
2. **多维度数据筛选**:实现超越单一指标的复杂数据筛选策略
3. **质量评分分析**:分析不同质量指标间的相关性与关联关系
4. **基准测试开发**:构建标准化的数据质量评估基准
5. **高效预训练**:筛选高质量子集以实现更高效的模型预训练
6. **领域特定分析**:对比不同领域(ArXiv、GitHub、Wikipedia等)的质量分布
## 标注流程
本数据集的质量评分通过以下方式生成:
- **规则化指标**:基于RedPajama与DSIR的成熟启发式规则提取得到
- **现有基于模型的评级**:使用源自FineWeb-Edu、WanjuanCC与QuRating的预训练分类器生成
- **PRRC评级**:先使用Llama-3.3-70B-Instruct完成标注,再通过微调后的ModernBERT模型实现高效评分
## 📚 引用
若您在研究中使用Meta-rater,请引用以下论文:
bibtex
@article{zhuang2025meta,
title={Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models},
author={Zhuang, Xinlin and Peng, Jiahui and Ma, Ren and Wang, Yinfan and Bai, Tianyi and Wei, Xingjian and Qiu, Jiantao and Zhang, Chi and Qian, Ying and He, Conghui},
journal={arXiv preprint arXiv:2504.14194},
year={2025}
}
## 📄 许可证
本数据集采用与原始SlimPajama数据集相同的许可证发布,详细许可条款请参阅原始SlimPajama仓库。
## 🤝 致谢
本研究基于以下工作构建:
- **SlimPajama**:Cerebras发布的原始数据集
- **RedPajama**:自然语言质量信号工具集
- **DSIR**:数据重要性评分方法论
- **FineWeb-Edu**:教育价值评估工具
- **WanjuanCC**:广告与流畅度检测工具
- **QuRating**:多维度质量评级框架
## 📞 联系方式
- **项目负责人**:马仁(maren@pjlab.org.cn)
- **通讯作者**:何聪辉(heconghui@pjlab.org.cn)
- **问题反馈**:请通过[GitHub Issues](https://github.com/opendatalab/Meta-rater/issues)提交疑问。
---
<div align="center">
**⭐ 如果您认为Meta-rater对您有帮助,请在GitHub与HuggingFace为我们点亮Star!⭐**
Made with ❤️ by the OpenDataLab团队
</div>
提供机构:
maas
创建时间:
2025-11-26



