five

SlimPajama-Meta-rater

收藏
魔搭社区2026-01-06 更新2025-11-29 收录
下载链接:
https://modelscope.cn/datasets/OpenDataLab/SlimPajama-Meta-rater
下载链接
链接失效反馈
官方服务:
资源简介:
# Annotated SlimPajama Dataset ## Dataset Description This dataset contains the **first fully annotated SlimPajama dataset** with comprehensive quality metrics for data-centric large language model research. The dataset includes approximately **580 billion tokens** from the training set of the original SlimPajama dataset, annotated across **25 different quality dimensions**. **Note**: This dataset contains only the training set portion of the original SlimPajama dataset, which is why the token count is approximately 580B rather than the full 627B tokens. ## Dataset Statistics - **Total samples**: ~580B tokens from SlimPajama training set - **Quality metrics**: 25 dimensions across 3 categories - **Domains**: 7 domains (CommonCrawl, C4, GitHub, Books, ArXiv, Wikipedia, StackExchange) - **Annotation coverage**: 100% of the training set ## Quality Metrics The dataset includes 25 quality scores across three main categories: ### 1. Natural Language Quality Signals (11 metrics) Rule-based measures from RedPajama indicating text naturalness: - `rps_doc_frac_no_alph_words`: Fraction of words with no alphabetical characters - `rps_doc_mean_word_length`: Mean word length after normalization - `rps_doc_frac_unique_words`: Fraction of unique words (degeneracy measure) - `rps_doc_unigram_entropy`: Entropy of unigram distribution - `rps_doc_word_count`: Number of words after normalization - `rps_lines_ending_with_terminal_punctution_mark`: Lines ending with terminal punctuation - `rps_lines_numerical_chars_fraction`: Ratio of numerical to total characters - `rps_lines_uppercase_letter_fraction`: Ratio of uppercase to total characters - `rps_doc_num_sentences`: Number of sentences in content - `rps_doc_frac_chars_top_2gram`: Fraction of characters in top word 2-gram - `rps_doc_frac_chars_top_3gram`: Fraction of characters in top word 3-gram ### 2. Data Importance Scores (3 metrics) DSIR-based importance weights measuring similarity to high-quality domains: - `dsir_books`: Importance score relative to Books domain - `dsir_wiki`: Importance score relative to Wikipedia domain - `dsir_math`: Importance score relative to AutoMathText domain ### 3. Model-based Quality Ratings (11 metrics) #### Existing Metrics: - `fineweb_edu`: Educational value (from FineWeb-Edu) - single value in list format - `ad_en`: Advertisement detection (from WanjuanCC) - logits for binary classification [label_0, label_1] - `fluency_en`: Fluency assessment (from WanjuanCC) - logits for binary classification [label_0, label_1] - `qurater`: QuRating scores as a list [Writing Style, Required Expertise, Facts and Trivia, Educational Value] #### PRRC Framework (Our Contribution): - `modernbert_professionalism`: Professionalism logits for 6 levels (0-5 scale) - use argmax() to get rating - `modernbert_readability`: Readability logits for 6 levels (0-5 scale) - use argmax() to get rating - `modernbert_reasoning`: Reasoning logits for 6 levels (0-5 scale) - use argmax() to get rating - `modernbert_cleanliness`: Cleanliness logits for 6 levels (0-5 scale) - use argmax() to get rating ## PRRC Framework Details Our **PRRC** framework introduces four novel dimensions for comprehensive data quality assessment: - **Professionalism**: Measures the degree of expertise and prerequisite knowledge required to comprehend the text - **Readability**: Evaluates text clarity, coherence, and ease of understanding - **Reasoning**: Assesses the complexity of logical reasoning and analytical thinking required - **Cleanliness**: Evaluates text formatting, completeness, and absence of noise/irrelevant content Each PRRC dimension uses a 5-point additive rating system, with models achieving F1 scores of 87-92% on test sets. ## Dataset Structure The dataset structure for each example: ```python { "id": "unique_document_id", "content": "Main text content of the document", "sub_path": "domain_name", # e.g., "arxiv", "github", "wikipedia", etc. # Natural Language Quality Signals (RedPajama-style metrics) "rps_doc_frac_no_alph_words": float, "rps_doc_mean_word_length": float, "rps_doc_frac_unique_words": float, "rps_doc_unigram_entropy": float, "rps_doc_word_count": int, "rps_lines_ending_with_terminal_punctution_mark": float, "rps_lines_numerical_chars_fraction": float, "rps_lines_uppercase_letter_fraction": float, "rps_doc_num_sentences": int, "rps_doc_frac_chars_top_2gram": float, "rps_doc_frac_chars_top_3gram": float, # Data Importance Scores (DSIR) "dsir_books": float, "dsir_wiki": float, "dsir_math": float, # Model-based Quality Ratings "fineweb_edu": [float], # Single value in list "ad_en": [float, float], # [has_ad_logit, no_ad_logit] - use argmax() to get 0-1 rating "fluency_en": [float, float], # [not_fluent_logit, fluent_logit] - use argmax() to get 0-1 rating "qurater": [float, float, float, float], # [Writing Style, Required Expertise, Facts and Trivia, Educational Value] # PRRC Framework (Our Contribution) - all contain 6 logits for levels 0-5 "modernbert_professionalism": [float, float, float, float, float, float], # Use argmax() to get 0-5 rating "modernbert_readability": [float, float, float, float, float, float], # Use argmax() to get 0-5 rating "modernbert_reasoning": [float, float, float, float, float, float], # Use argmax() to get 0-5 rating "modernbert_cleanliness": [float, float, float, float, float, float] # Use argmax() to get 0-5 rating } ``` ## Usage ### Loading the Dataset ```python from datasets import load_dataset # Load the full dataset dataset = load_dataset("opendatalab/SlimPajama-627B-Annotated") # Load a specific split if available train_dataset = load_dataset("opendatalab/SlimPajama-627B-Annotated", split="train") ``` ### Data Processing and Selection Example ```python import pandas as pd import numpy as np from datasets import load_dataset # Load dataset dataset = load_dataset("opendatalab/SlimPajama-627B-Annotated", split="train") # Convert to pandas for easier manipulation df = dataset.to_pandas() # Process PRRC scores (convert logits to ratings using argmax) df['professionalism_score'] = df['modernbert_professionalism'].apply(lambda x: np.argmax(x)) df['readability_score'] = df['modernbert_readability'].apply(lambda x: np.argmax(x)) df['reasoning_score'] = df['modernbert_reasoning'].apply(lambda x: np.argmax(x)) df['cleanliness_score'] = df['modernbert_cleanliness'].apply(lambda x: np.argmax(x)) # Process binary classification scores df['advertisement_score'] = df['ad_en'].apply(lambda x: np.argmax(x)) # 0 = has ad, 1 = no ad df['fluency_score'] = df['fluency_en'].apply(lambda x: np.argmax(x)) # 0 = not fluent, 1 = fluent # Extract QuRating scores df['writing_style'] = df['qurater'].apply(lambda x: x[0]) df['required_expertise'] = df['qurater'].apply(lambda x: x[1]) df['facts_trivia'] = df['qurater'].apply(lambda x: x[2]) df['educational_value'] = df['qurater'].apply(lambda x: x[3]) # Extract FineWeb-Edu score df['fineweb_educational'] = df['fineweb_edu'].apply(lambda x: x[0]) # Example: Multi-dimensional quality score combination (Meta-rater approach) # Using the learned weights from the Meta-rater paper weights = { 'educational_value': 0.0564, # From qurater[3] 'rps_doc_frac_no_alph_words': 0.0493, 'fineweb_educational': 0.0493, 'rps_lines_uppercase_letter_fraction': 0.0488, 'facts_trivia': 0.0477, # From qurater[2] 'rps_doc_frac_chars_top_3gram': 0.0473, 'rps_lines_ending_with_terminal_punctution_mark': 0.0473, 'rps_doc_frac_chars_top_2gram': 0.0471, 'dsir_wiki': 0.0469, 'rps_lines_numerical_chars_fraction': 0.0460, 'rps_doc_num_sentences': 0.0458, 'dsir_math': 0.0448, 'reasoning_score': 0.0444, 'rps_doc_frac_unique_words': 0.0432, 'rps_doc_word_count': 0.0423, 'rps_doc_unigram_entropy': 0.0422, 'dsir_books': 0.0414, 'professionalism_score': 0.0405, 'fluency_score': 0.0402, 'readability_score': 0.0393, 'required_expertise': 0.0373, # From qurater[1] 'advertisement_score': 0.0368, 'cleanliness_score': 0.0117, 'rps_doc_mean_word_length': 0.0065, 'writing_style': 0.0005, # From qurater[0] } # Calculate weighted quality score quality_score = np.zeros(len(df)) for metric, weight in weights.items(): if metric in df.columns: quality_score += df[metric].values * weight # Select top-k samples based on quality score top_k = 10000 top_k_indices = np.argsort(quality_score)[-top_k:] selected_data = df.iloc[top_k_indices] print(f"Selected top {top_k} samples using Meta-rater weights") ``` ## Applications This annotated dataset enables: 1. **Data-Centric LLM Research**: Study the impact of different quality dimensions on model performance 2. **Multi-dimensional Data Selection**: Implement sophisticated data selection strategies beyond single-metric approaches 3. **Quality Score Analysis**: Analyze correlations and relationships between different quality metrics 4. **Benchmark Development**: Create standardized benchmarks for data quality assessment 5. **Efficient Pre-training**: Select high-quality subsets for more efficient model training 6. **Domain-specific Analysis**: Compare quality distributions across different domains (ArXiv, GitHub, Wikipedia, etc.) ## Annotation Process The quality scores were generated using: - **Rule-based metrics**: Extracted using established heuristics from RedPajama and DSIR - **Existing model-based ratings**: Applied pre-trained classifiers from FineWeb-Edu, WanjuanCC, and QuRating - **PRRC ratings**: Generated using Llama-3.3-70B-Instruct for annotation, followed by fine-tuned ModernBERT models for efficient scoring ## 📚 Citation If you use Meta-rater in your research, please cite our paper: ```bibtex @article{zhuang2025meta, title={Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models}, author={Zhuang, Xinlin and Peng, Jiahui and Ma, Ren and Wang, Yinfan and Bai, Tianyi and Wei, Xingjian and Qiu, Jiantao and Zhang, Chi and Qian, Ying and He, Conghui}, journal={arXiv preprint arXiv:2504.14194}, year={2025} } ``` ## 📄 License This dataset is released under the same license as the original SlimPajama dataset. Please refer to the original SlimPajama repository for licensing details. ## 🤝 Acknowledgments This work builds upon: - **SlimPajama**: The original dataset from Cerebras - **RedPajama**: Natural language quality signals - **DSIR**: Data importance scoring methodology - **FineWeb-Edu**: Educational value assessment - **WanjuanCC**: Advertisement and fluency detection - **QuRating**: Multi-dimensional quality rating framework ## 📞 Contact - **Project Lead**: Ren Ma (maren@pjlab.org.cn) - **Corresponding Author**: Conghui He (heconghui@pjlab.org.cn) - **Issues**: Please use [GitHub Issues](https://github.com/opendatalab/Meta-rater/issues) for questions. --- <div align="center"> **⭐ Star us on GitHub and HuggingFace if you find Meta-rater useful! ⭐** Made with ❤️ by the OpenDataLab team </div>

# 标注版SlimPajama数据集 ## 数据集概述 本数据集为**首个完整标注版SlimPajama数据集**,包含面向以数据为中心的大语言模型(Large Language Model)研究的全面质量指标。该数据集源自原始SlimPajama数据集的训练集,包含约**5800亿个Token**,并在**25个不同质量维度**上完成标注。 **注意**:本数据集仅包含原始SlimPajama数据集的训练集部分,因此Token数量约为5800亿,而非完整数据集的6270亿。 ## 数据集统计 - **总样本量**:SlimPajama训练集中的约5800亿个Token - **质量指标**:涵盖3大类共25个维度 - **领域覆盖**:7个领域(CommonCrawl、C4、GitHub、Books、ArXiv、Wikipedia、StackExchange) - **标注覆盖率**:训练集100%覆盖 ## 质量指标 本数据集包含3大类共25项质量评分: ### 1. 自然语言质量信号(11项指标) 基于RedPajama的规则化度量,用于表征文本自然度: - `rps_doc_frac_no_alph_words`:不含字母字符的单词占比 - `rps_doc_mean_word_length`:归一化后的平均单词长度 - `rps_doc_frac_unique_words`:唯一单词占比(退化程度度量) - `rps_doc_unigram_entropy`:一元语法分布的熵 - `rps_doc_word_count`:归一化后的单词总数 - `rps_lines_ending_with_terminal_punctution_mark`:以终结标点结尾的行占比 - `rps_lines_numerical_chars_fraction`:数字字符占总字符的比例 - `rps_lines_uppercase_letter_fraction`:大写字母占总字符的比例 - `rps_doc_num_sentences`:文本中的句子总数 - `rps_doc_frac_chars_top_2gram`:前2个高频单词的字符占比 - `rps_doc_frac_chars_top_3gram`:前3个高频单词的字符占比 ### 2. 数据重要性评分(3项指标) 基于DSIR的重要性权重,用于衡量与高质量领域的相似度: - `dsir_books`:相较于Books领域的重要性评分 - `dsir_wiki`:相较于Wikipedia领域的重要性评分 - `dsir_math`:相较于AutoMathText领域的重要性评分 ### 3. 基于模型的质量评级(11项指标) #### 现有指标 - `fineweb_edu`:教育价值评分(源自FineWeb-Edu)——以列表格式存储的单一数值 - `ad_en`:广告检测评分(源自WanjuanCC)——二分类任务的Logit值,格式为[label_0, label_1] - `fluency_en`:流畅度评估(源自WanjuanCC)——二分类任务的Logit值,格式为[label_0, label_1] - `qurater`:QuRating评分,以列表形式存储,依次为[写作风格、所需专业知识、事实与常识、教育价值] #### PRRC框架(本研究原创贡献) - `modernbert_professionalism`:专业度Logit值,共6个等级(0-5量表)——可通过argmax()函数获取最终评级 - `modernbert_readability`:可读性Logit值,共6个等级(0-5量表)——可通过argmax()函数获取最终评级 - `modernbert_reasoning`:推理能力Logit值,共6个等级(0-5量表)——可通过argmax()函数获取最终评级 - `modernbert_cleanliness`:整洁度Logit值,共6个等级(0-5量表)——可通过argmax()函数获取最终评级 ## PRRC框架详情 本研究提出的**PRRC框架**引入了4个全新维度,用于实现全面的数据质量评估: - **专业度(Professionalism)**:衡量理解文本所需的专业程度与前置知识门槛 - **可读性(Readability)**:评估文本的清晰度、连贯性与易懂性 - **推理能力(Reasoning)**:评估所需逻辑推理与分析思维的复杂程度 - **整洁度(Cleanliness)**:评估文本的格式规范性、完整性以及无噪声/无关内容的程度 每个PRRC维度均采用5分加法评级体系,相关模型在测试集上的F1分数可达87%-92%。 ## 数据集结构 每个样本的数据集结构如下: python { "id": "唯一文档ID", "content": "文档的主要文本内容", "sub_path": "领域名称", # 例如:"arxiv"、"github"、"wikipedia" 等 # 自然语言质量信号(RedPajama风格指标) "rps_doc_frac_no_alph_words": float, "rps_doc_mean_word_length": float, "rps_doc_frac_unique_words": float, "rps_doc_unigram_entropy": float, "rps_doc_word_count": int, "rps_lines_ending_with_terminal_punctution_mark": float, "rps_lines_numerical_chars_fraction": float, "rps_lines_uppercase_letter_fraction": float, "rps_doc_num_sentences": int, "rps_doc_frac_chars_top_2gram": float, "rps_doc_frac_chars_top_3gram": float, # 数据重要性评分(DSIR) "dsir_books": float, "dsir_wiki": float, "dsir_math": float, # 基于模型的质量评级 "fineweb_edu": [float], # 单一数值的列表格式 "ad_en": [float, float], # [存在广告Logit, 无广告Logit] —— 可通过argmax()获取0-1评级 "fluency_en": [float, float], # [不流畅Logit, 流畅Logit] —— 可通过argmax()获取0-1评级 "qurater": [float, float, float, float], # [写作风格、所需专业知识、事实与常识、教育价值] # PRRC框架(本研究原创贡献)—— 均包含0-5级共6个Logit值 "modernbert_professionalism": [float, float, float, float, float, float], # 可通过argmax()获取0-5评级 "modernbert_readability": [float, float, float, float, float, float], # 可通过argmax()获取0-5评级 "modernbert_reasoning": [float, float, float, float, float, float], # 可通过argmax()获取0-5评级 "modernbert_cleanliness": [float, float, float, float, float, float] # 可通过argmax()获取0-5评级 } ## 使用方法 ### 加载数据集 python from datasets import load_dataset # 加载完整数据集 dataset = load_dataset("opendatalab/SlimPajama-627B-Annotated") # 若需加载指定划分,可使用如下方式 train_dataset = load_dataset("opendatalab/SlimPajama-627B-Annotated", split="train") ### 数据处理与筛选示例 python import pandas as pd import numpy as np from datasets import load_dataset # 加载数据集 dataset = load_dataset("opendatalab/SlimPajama-627B-Annotated", split="train") # 将数据集转换为Pandas DataFrame以方便操作 df = dataset.to_pandas() # 处理PRRC评分:通过argmax()将Logit转换为评级 df['professionalism_score'] = df['modernbert_professionalism'].apply(lambda x: np.argmax(x)) df['readability_score'] = df['modernbert_readability'].apply(lambda x: np.argmax(x)) df['reasoning_score'] = df['modernbert_reasoning'].apply(lambda x: np.argmax(x)) df['cleanliness_score'] = df['modernbert_cleanliness'].apply(lambda x: np.argmax(x)) # 处理二分类评分 df['advertisement_score'] = df['ad_en'].apply(lambda x: np.argmax(x)) # 0 = 存在广告,1 = 无广告 df['fluency_score'] = df['fluency_en'].apply(lambda x: np.argmax(x)) # 0 = 不流畅,1 = 流畅 # 提取QuRating评分 df['writing_style'] = df['qurater'].apply(lambda x: x[0]) df['required_expertise'] = df['qurater'].apply(lambda x: x[1]) df['facts_trivia'] = df['qurater'].apply(lambda x: x[2]) df['educational_value'] = df['qurater'].apply(lambda x: x[3]) # 提取FineWeb-Edu评分 df['fineweb_educational'] = df['fineweb_edu'].apply(lambda x: x[0]) # 示例:多维度质量评分组合(元评分器方法) # 使用元评分器论文中的学习权重 weights = { 'educational_value': 0.0564, # 源自qurater[3] 'rps_doc_frac_no_alph_words': 0.0493, 'fineweb_educational': 0.0493, 'rps_lines_uppercase_letter_fraction': 0.0488, 'facts_trivia': 0.0477, # 源自qurater[2] 'rps_doc_frac_chars_top_3gram': 0.0473, 'rps_lines_ending_with_terminal_punctution_mark': 0.0473, 'rps_doc_frac_chars_top_2gram': 0.0471, 'dsir_wiki': 0.0469, 'rps_lines_numerical_chars_fraction': 0.0460, 'rps_doc_num_sentences': 0.0458, 'dsir_math': 0.0448, 'reasoning_score': 0.0444, 'rps_doc_frac_unique_words': 0.0432, 'rps_doc_word_count': 0.0423, 'rps_doc_unigram_entropy': 0.0422, 'dsir_books': 0.0414, 'professionalism_score': 0.0405, 'fluency_score': 0.0402, 'readability_score': 0.0393, 'required_expertise': 0.0373, # 源自qurater[1] 'advertisement_score': 0.0368, 'cleanliness_score': 0.0117, 'rps_doc_mean_word_length': 0.0065, 'writing_style': 0.0005, # 源自qurater[0] } # 计算加权质量评分 quality_score = np.zeros(len(df)) for metric, weight in weights.items(): if metric in df.columns: quality_score += df[metric].values * weight # 基于质量评分选择Top-k样本 top_k = 10000 top_k_indices = np.argsort(quality_score)[-top_k:] selected_data = df.iloc[top_k_indices] print(f"使用元评分器权重选择的Top {top_k} 个样本") ## 应用场景 本标注数据集可用于以下研究方向: 1. **以数据为中心的大语言模型研究**:探究不同质量维度对模型性能的影响 2. **多维度数据筛选**:实现超越单一指标的复杂数据筛选策略 3. **质量评分分析**:分析不同质量指标间的相关性与关联关系 4. **基准测试开发**:构建标准化的数据质量评估基准 5. **高效预训练**:筛选高质量子集以实现更高效的模型预训练 6. **领域特定分析**:对比不同领域(ArXiv、GitHub、Wikipedia等)的质量分布 ## 标注流程 本数据集的质量评分通过以下方式生成: - **规则化指标**:基于RedPajama与DSIR的成熟启发式规则提取得到 - **现有基于模型的评级**:使用源自FineWeb-Edu、WanjuanCC与QuRating的预训练分类器生成 - **PRRC评级**:先使用Llama-3.3-70B-Instruct完成标注,再通过微调后的ModernBERT模型实现高效评分 ## 📚 引用 若您在研究中使用Meta-rater,请引用以下论文: bibtex @article{zhuang2025meta, title={Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models}, author={Zhuang, Xinlin and Peng, Jiahui and Ma, Ren and Wang, Yinfan and Bai, Tianyi and Wei, Xingjian and Qiu, Jiantao and Zhang, Chi and Qian, Ying and He, Conghui}, journal={arXiv preprint arXiv:2504.14194}, year={2025} } ## 📄 许可证 本数据集采用与原始SlimPajama数据集相同的许可证发布,详细许可条款请参阅原始SlimPajama仓库。 ## 🤝 致谢 本研究基于以下工作构建: - **SlimPajama**:Cerebras发布的原始数据集 - **RedPajama**:自然语言质量信号工具集 - **DSIR**:数据重要性评分方法论 - **FineWeb-Edu**:教育价值评估工具 - **WanjuanCC**:广告与流畅度检测工具 - **QuRating**:多维度质量评级框架 ## 📞 联系方式 - **项目负责人**:马仁(maren@pjlab.org.cn) - **通讯作者**:何聪辉(heconghui@pjlab.org.cn) - **问题反馈**:请通过[GitHub Issues](https://github.com/opendatalab/Meta-rater/issues)提交疑问。 --- <div align="center"> **⭐ 如果您认为Meta-rater对您有帮助,请在GitHub与HuggingFace为我们点亮Star!⭐** Made with ❤️ by the OpenDataLab团队 </div>
提供机构:
maas
创建时间:
2025-11-26
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作