Meta-rater-PRRC-Rater-dataset
收藏魔搭社区2025-12-04 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/OpenDataLab/Meta-rater-PRRC-Rater-dataset
下载链接
链接失效反馈官方服务:
资源简介:
# PRRC Rater Training and Evaluation Dataset
## Dataset Description
This dataset contains the full training and evaluation data for the PRRC rater models described in [Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models](https://arxiv.org/abs/2504.14194). It is designed for training and benchmarking models that score text along four key quality dimensions: **Professionalism, Readability, Reasoning, and Cleanliness**.
- **Source**: Subset of SlimPajama-627B, annotated for PRRC dimensions
- **Purpose**: Supervised training and evaluation of PRRC raters (ModernBERT models)
- **Annotation**: Each sample is labeled by Llama-3.3-70B-Instruct and/or human annotators, then used to fine-tune and benchmark PRRC raters
## Dataset Statistics
- **Total samples**: ~1M (split into train/dev/test)
- **Quality metrics**: 4 PRRC dimensions (Professionalism, Readability, Reasoning, Cleanliness)
- **Domains**: Diverse (CommonCrawl, C4, GitHub, Books, ArXiv, Wikipedia, StackExchange)
- **Annotation coverage**: 100% of included samples
## PRRC Quality Dimensions
- **Professionalism**: Degree of expertise and prerequisite knowledge required
- **Readability**: Clarity, coherence, and ease of understanding
- **Reasoning**: Complexity of logical reasoning and analytical thinking
- **Cleanliness**: Formatting, completeness, and absence of noise/irrelevant content
Each dimension is rated on a 0–5 scale, with detailed prompt criteria provided in the [prompts/](./prompts/) directory of the GitHub repo.
## Dataset Structure
Each example in the dataset has the following structure:
```python
{
"id": "unique_document_id",
"content": "Main text content of the document",
"source": "domain_name", # e.g., "arxiv", "github", "wikipedia", etc.
"professionalism": int, # 0-5
"readability": int, # 0-5
"reasoning": int, # 0-5
"cleanliness": int # 0-5
}
```
## Usage
### Loading the Dataset
```python
from datasets import load_dataset
# Load the full PRRC rater dataset
dataset = load_dataset("opendatalab/Meta-rater-PRRC-Rater-dataset")
# Access splits
train = dataset["train"]
dev = dataset["validation"]
test = dataset["test"]
```
## Applications
- **Supervised training** of PRRC rater models (e.g., ModernBERT)
- **Benchmarking** and evaluation of text quality raters
- **Prompt engineering** and ablation studies for quality annotation
- **Data-centric LLM research**: Understanding the impact of different quality dimensions
## Annotation Process
- **Initial annotation**: Llama-3.3-70B-Instruct (and/or human) rates each sample for all four PRRC dimensions using detailed prompts
- **Quality control**: Manual review and cleaning
- **Splitting**: Data is split into train/dev/test for robust evaluation
## Citation
If you use this dataset, please cite:
```bibtex
@article{zhuang2025meta,
title={Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models},
author={Zhuang, Xinlin and Peng, Jiahui and Ma, Ren and Wang, Yinfan and Bai, Tianyi and Wei, Xingjian and Qiu, Jiantao and Zhang, Chi and Qian, Ying and He, Conghui},
journal={arXiv preprint arXiv:2504.14194},
year={2025}
}
```
## License
This dataset is released under the same license as the original SlimPajama dataset. Please refer to the original SlimPajama repository for licensing details.
## Contact
- **Project Lead**: Ren Ma (maren@pjlab.org.cn)
- **Corresponding Author**: Conghui He (heconghui@pjlab.org.cn)
- **Issues**: [GitHub Issues](https://github.com/opendatalab/Meta-rater/issues)
---
**Made with ❤️ by the OpenDataLab team**
# PRRC评分器训练与评估数据集
## 数据集描述
本数据集包含论文《Meta-rater:面向预训练语言模型的多维数据选择方法》(Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models,arXiv:2504.14194)中提及的PRRC评分器模型的完整训练与评估数据,用于训练和基准测试可从四个核心质量维度对文本进行评分的模型:**专业性(Professionalism)、可读性(Readability)、推理性(Reasoning)、整洁性(Cleanliness)**。
- **来源**:SlimPajama-627B的子集,已针对PRRC维度完成标注
- **用途**:用于PRRC评分器(ModernBERT模型)的监督训练与基准评估
- **标注方式**:每个样本均由Llama-3.3-70B-Instruct和/或人工标注员完成标注,随后用于微调PRRC评分器并对其进行基准测试
## 数据集统计信息
- **总样本量**:约100万(划分为训练集、开发集与测试集)
- **质量维度**:4项PRRC评分维度(专业性、可读性、推理性、整洁性)
- **覆盖领域**:多元领域(包括CommonCrawl、C4、GitHub、图书、ArXiv、维基百科、StackExchange)
- **标注覆盖率**:所有纳入样本均完成标注
## PRRC质量评分维度
- **专业性(Professionalism)**:文本所需的专业程度与前置知识储备
- **可读性(Readability)**:文本的清晰度、连贯性与理解难度
- **推理性(Reasoning)**:逻辑推理与分析性思维的复杂程度
- **整洁性(Cleanliness)**:文本格式、完整性与噪声/无关内容的缺失程度
每项维度均采用0-5分制进行评分,详细的提示词评分标准可在GitHub仓库的`prompts/`目录中获取。
## 数据集结构
本数据集的每条样本均遵循以下结构:
python
{
"id": "唯一文档标识符",
"content": "文档的核心文本内容",
"source": "领域名称", # 示例:"arxiv"、"github"、"wikipedia"等
"professionalism": int, # 0-5分
"readability": int, # 0-5分
"reasoning": int, # 0-5分
"cleanliness": int # 0-5分
}
## 使用方法
### 数据集加载
python
from datasets import load_dataset
# 加载完整的PRRC评分器数据集
dataset = load_dataset("opendatalab/Meta-rater-PRRC-Rater-dataset")
# 访问划分集
train = dataset["train"]
dev = dataset["validation"]
test = dataset["test"]
## 应用场景
- **监督训练**:PRRC评分器模型(如ModernBERT)的监督训练
- **基准评估**:文本质量评分器的基准测试与评估
- **提示工程**:质量标注的提示工程与消融实验
- **数据-centric大语言模型(Large Language Model,LLM)研究**:探究不同质量维度对预训练语言模型的影响
## 标注流程
- **初始标注**:Llama-3.3-70B-Instruct(或人工)依据详细提示词对全部4项PRRC维度进行评分
- **质量管控**:人工审核与数据清洗
- **数据集划分**:将数据划分为训练集、开发集与测试集,以保障评估的稳健性
## 引用方式
若使用本数据集,请引用以下文献:
bibtex
@article{zhuang2025meta,
title={Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models},
author={Zhuang, Xinlin and Peng, Jiahui and Ma, Ren and Wang, Yinfan and Bai, Tianyi and Wei, Xingjian and Qiu, Jiantao and Zhang, Chi and Qian, Ying and He, Conghui},
journal={arXiv preprint arXiv:2504.14194},
year={2025}
}
## 开源许可
本数据集采用与原始SlimPajama数据集一致的开源协议,具体许可细节请参阅原始SlimPajama仓库。
## 联系方式
- **项目负责人**:马任(maren@pjlab.org.cn)
- **通讯作者**:何聪辉(heconghui@pjlab.org.cn)
- **问题反馈**:[GitHub Issues](https://github.com/opendatalab/Meta-rater/issues)
--- **由OpenDataLab团队倾力打造**
提供机构:
maas
创建时间:
2025-11-26



