five

Meta-rater-PRRC-Rater-dataset

收藏
魔搭社区2025-12-04 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/OpenDataLab/Meta-rater-PRRC-Rater-dataset
下载链接
链接失效反馈
官方服务:
资源简介:
# PRRC Rater Training and Evaluation Dataset ## Dataset Description This dataset contains the full training and evaluation data for the PRRC rater models described in [Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models](https://arxiv.org/abs/2504.14194). It is designed for training and benchmarking models that score text along four key quality dimensions: **Professionalism, Readability, Reasoning, and Cleanliness**. - **Source**: Subset of SlimPajama-627B, annotated for PRRC dimensions - **Purpose**: Supervised training and evaluation of PRRC raters (ModernBERT models) - **Annotation**: Each sample is labeled by Llama-3.3-70B-Instruct and/or human annotators, then used to fine-tune and benchmark PRRC raters ## Dataset Statistics - **Total samples**: ~1M (split into train/dev/test) - **Quality metrics**: 4 PRRC dimensions (Professionalism, Readability, Reasoning, Cleanliness) - **Domains**: Diverse (CommonCrawl, C4, GitHub, Books, ArXiv, Wikipedia, StackExchange) - **Annotation coverage**: 100% of included samples ## PRRC Quality Dimensions - **Professionalism**: Degree of expertise and prerequisite knowledge required - **Readability**: Clarity, coherence, and ease of understanding - **Reasoning**: Complexity of logical reasoning and analytical thinking - **Cleanliness**: Formatting, completeness, and absence of noise/irrelevant content Each dimension is rated on a 0–5 scale, with detailed prompt criteria provided in the [prompts/](./prompts/) directory of the GitHub repo. ## Dataset Structure Each example in the dataset has the following structure: ```python { "id": "unique_document_id", "content": "Main text content of the document", "source": "domain_name", # e.g., "arxiv", "github", "wikipedia", etc. "professionalism": int, # 0-5 "readability": int, # 0-5 "reasoning": int, # 0-5 "cleanliness": int # 0-5 } ``` ## Usage ### Loading the Dataset ```python from datasets import load_dataset # Load the full PRRC rater dataset dataset = load_dataset("opendatalab/Meta-rater-PRRC-Rater-dataset") # Access splits train = dataset["train"] dev = dataset["validation"] test = dataset["test"] ``` ## Applications - **Supervised training** of PRRC rater models (e.g., ModernBERT) - **Benchmarking** and evaluation of text quality raters - **Prompt engineering** and ablation studies for quality annotation - **Data-centric LLM research**: Understanding the impact of different quality dimensions ## Annotation Process - **Initial annotation**: Llama-3.3-70B-Instruct (and/or human) rates each sample for all four PRRC dimensions using detailed prompts - **Quality control**: Manual review and cleaning - **Splitting**: Data is split into train/dev/test for robust evaluation ## Citation If you use this dataset, please cite: ```bibtex @article{zhuang2025meta, title={Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models}, author={Zhuang, Xinlin and Peng, Jiahui and Ma, Ren and Wang, Yinfan and Bai, Tianyi and Wei, Xingjian and Qiu, Jiantao and Zhang, Chi and Qian, Ying and He, Conghui}, journal={arXiv preprint arXiv:2504.14194}, year={2025} } ``` ## License This dataset is released under the same license as the original SlimPajama dataset. Please refer to the original SlimPajama repository for licensing details. ## Contact - **Project Lead**: Ren Ma (maren@pjlab.org.cn) - **Corresponding Author**: Conghui He (heconghui@pjlab.org.cn) - **Issues**: [GitHub Issues](https://github.com/opendatalab/Meta-rater/issues) --- **Made with ❤️ by the OpenDataLab team**

# PRRC评分器训练与评估数据集 ## 数据集描述 本数据集包含论文《Meta-rater:面向预训练语言模型的多维数据选择方法》(Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models,arXiv:2504.14194)中提及的PRRC评分器模型的完整训练与评估数据,用于训练和基准测试可从四个核心质量维度对文本进行评分的模型:**专业性(Professionalism)、可读性(Readability)、推理性(Reasoning)、整洁性(Cleanliness)**。 - **来源**:SlimPajama-627B的子集,已针对PRRC维度完成标注 - **用途**:用于PRRC评分器(ModernBERT模型)的监督训练与基准评估 - **标注方式**:每个样本均由Llama-3.3-70B-Instruct和/或人工标注员完成标注,随后用于微调PRRC评分器并对其进行基准测试 ## 数据集统计信息 - **总样本量**:约100万(划分为训练集、开发集与测试集) - **质量维度**:4项PRRC评分维度(专业性、可读性、推理性、整洁性) - **覆盖领域**:多元领域(包括CommonCrawl、C4、GitHub、图书、ArXiv、维基百科、StackExchange) - **标注覆盖率**:所有纳入样本均完成标注 ## PRRC质量评分维度 - **专业性(Professionalism)**:文本所需的专业程度与前置知识储备 - **可读性(Readability)**:文本的清晰度、连贯性与理解难度 - **推理性(Reasoning)**:逻辑推理与分析性思维的复杂程度 - **整洁性(Cleanliness)**:文本格式、完整性与噪声/无关内容的缺失程度 每项维度均采用0-5分制进行评分,详细的提示词评分标准可在GitHub仓库的`prompts/`目录中获取。 ## 数据集结构 本数据集的每条样本均遵循以下结构: python { "id": "唯一文档标识符", "content": "文档的核心文本内容", "source": "领域名称", # 示例:"arxiv"、"github"、"wikipedia"等 "professionalism": int, # 0-5分 "readability": int, # 0-5分 "reasoning": int, # 0-5分 "cleanliness": int # 0-5分 } ## 使用方法 ### 数据集加载 python from datasets import load_dataset # 加载完整的PRRC评分器数据集 dataset = load_dataset("opendatalab/Meta-rater-PRRC-Rater-dataset") # 访问划分集 train = dataset["train"] dev = dataset["validation"] test = dataset["test"] ## 应用场景 - **监督训练**:PRRC评分器模型(如ModernBERT)的监督训练 - **基准评估**:文本质量评分器的基准测试与评估 - **提示工程**:质量标注的提示工程与消融实验 - **数据-centric大语言模型(Large Language Model,LLM)研究**:探究不同质量维度对预训练语言模型的影响 ## 标注流程 - **初始标注**:Llama-3.3-70B-Instruct(或人工)依据详细提示词对全部4项PRRC维度进行评分 - **质量管控**:人工审核与数据清洗 - **数据集划分**:将数据划分为训练集、开发集与测试集,以保障评估的稳健性 ## 引用方式 若使用本数据集,请引用以下文献: bibtex @article{zhuang2025meta, title={Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models}, author={Zhuang, Xinlin and Peng, Jiahui and Ma, Ren and Wang, Yinfan and Bai, Tianyi and Wei, Xingjian and Qiu, Jiantao and Zhang, Chi and Qian, Ying and He, Conghui}, journal={arXiv preprint arXiv:2504.14194}, year={2025} } ## 开源许可 本数据集采用与原始SlimPajama数据集一致的开源协议,具体许可细节请参阅原始SlimPajama仓库。 ## 联系方式 - **项目负责人**:马任(maren@pjlab.org.cn) - **通讯作者**:何聪辉(heconghui@pjlab.org.cn) - **问题反馈**:[GitHub Issues](https://github.com/opendatalab/Meta-rater/issues) --- **由OpenDataLab团队倾力打造**
提供机构:
maas
创建时间:
2025-11-26
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作