five

mmBERT-midtraining-data

收藏
魔搭社区2025-09-28 更新2025-09-20 收录
下载链接:
https://modelscope.cn/datasets/jhu-clsp/mmBERT-midtraining-data
下载链接
链接失效反馈
官方服务:
资源简介:
# mmBERT Mid-training Data [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) [![Paper](https://img.shields.io/badge/Paper-Arxiv-red)](https://arxiv.org/abs/2509.06888) [![Models](https://img.shields.io/badge/🤗%20Hugging%20Face-2%20Models-blue)](https://huggingface.co/collections/jhu-clsp/mmbert-a-modern-multilingual-encoder-68b725831d7c6e3acc435ed4) [![GitHub](https://img.shields.io/badge/GitHub-Code-black)](https://github.com/jhu-clsp/mmBERT) > **Phase 2 of 3**: High-quality mid-training data mixture (600B tokens) with context extension to 8192 tokens. This dataset contains the mid-training phase data used to train all [mmBERT encoder models](https://huggingface.co/collections/jhu-clsp/mmbert-a-modern-multilingual-encoder-68b725831d7c6e3acc435ed4). This phase focuses on higher quality data sources and extends the context length from 1024 to 8192 tokens. The data is provided in **MDS format** ready for use with [Composer](https://github.com/mosaicml/composer) and the [ModernBERT training repository](https://github.com/answerdotai/ModernBERT). ## 📊 Data Composition | Data Source | Tokens (B) | Percentage | Description | |:------------|:-----------|:-----------|:------------| | FineWeb2 | 506.7 | 84.3% | High-quality multilingual web crawl data | | DCLM (Dolmino) | 40.0 | 6.7% | Filtered high-quality English web data | | Starcoder | 17.2 | 2.9% | Code repositories and files | | Arxiv | 5.4 | 0.9% | Academic preprints | | Dolmino Math | 4.3 | 0.7% | Mathematical content | | Books | 3.9 | 0.7% | Literature and reference books | | PeS2o | 3.2 | 0.5% | Scientific papers | | Tulu Flan | 3.1 | 0.5% | Instruction-following data | | StackExchange | 3.0 | 0.5% | Q&A forums | | StackExchange (Dolmino) | 2.8 | 0.5% | Curated Q&A content | | Wikipedia (MegaWika) | 1.2 | 0.2% | Encyclopedia articles | | **Total** | **600.8** | **100.0%** | High-quality data for context extension | ## 🌍 Language Coverage This phase covers **110 languages** plus code, with inverse temperature sampling at τ=0.5. Expands from the initial 60 languages to include: - **Additional mid-resource languages**: Uzbek, Bosnian, Catalan, Albanian, and 46 others - **Enhanced quality**: Uses filtered FineWeb2-HQ and higher quality DCLM - **Longer contexts**: Optimized for 8192 token sequences ## ⚙️ Key Features - **Context Extension**: RoPE base frequency adjusted to 160k for 8192 token support - **Quality Upgrade**: Switches to filtered, higher-quality versions of datasets - **Reduced Masking**: Mask rate lowered to 15% (from 30% in pre-training) - **Language Expansion**: Adds 50 new languages while maintaining data quality ## 🚀 Usage For mid-training, see the ModernBERT repo: https://github.com/AnswerDotAI/ModernBERT ### Direct Access ```python from streaming import StreamingDataset # Load the streaming dataset dataset = StreamingDataset( remote='https://huggingface.co/datasets/jhu-clsp/mmbert-midtraining', local='/tmp/mmbert-midtraining-data', shuffle=True ) # Access samples for sample in dataset: text = sample['text'] # Process your data... ``` ## 🔗 Related Resources - **Models**: [mmBERT Model Suite](https://huggingface.co/collections/jhu-clsp/mmbert-a-modern-multilingual-encoder-68b725831d7c6e3acc435ed4) - **Phase 1**: [Pre-training Data](https://huggingface.co/datasets/jhu-clsp/mmbert-pretrain-p1-fineweb2-langs) (2.3T tokens) - **Phase 3**: [Decay Phase Data](https://huggingface.co/datasets/jhu-clsp/mmbert-decay) (100B tokens) - **Checkpoints**: [Training Checkpoints](https://huggingface.co/datasets/jhu-clsp/mmbert-checkpoints) - **Paper**: [Arxiv link](https://arxiv.org/abs/2509.06888) - **Code**: [GitHub Repository](https://github.com/jhu-clsp/mmBERT) ## Citation ```bibtex @misc{marone2025mmbertmodernmultilingualencoder, title={mmBERT: A Modern Multilingual Encoder with Annealed Language Learning}, author={Marc Marone and Orion Weller and William Fleshman and Eugene Yang and Dawn Lawrie and Benjamin Van Durme}, year={2025}, eprint={2509.06888}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2509.06888}, } ```

# mmBERT 训练中期数据集 [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) [![Paper](https://img.shields.io/badge/Paper-Arxiv-red)](https://arxiv.org/abs/2509.06888) [![Models](https://img.shields.io/badge/🤗%20Hugging%20Face-2%20Models-blue)](https://huggingface.co/collections/jhu-clsp/mmbert-a-modern-multilingual-encoder-68b725831d7c6e3acc435ed4) [![GitHub](https://img.shields.io/badge/GitHub-Code-black)](https://github.com/jhu-clsp/mmBERT) > **三阶段中的第二阶段**:高质量训练中期混合数据集(含6000亿Token),上下文长度扩展至8192个Token。 本数据集为所有mmBERT编码器模型的训练中期阶段所用数据。本阶段聚焦于更高质量的数据源,将上下文长度从1024个Token扩展至8192个Token。数据集以**MDS格式**提供,可直接用于[Composer](https://github.com/mosaicml/composer)与[ModernBERT训练仓库](https://github.com/answerdotai/ModernBERT)。 ## 📊 数据集构成 | 数据源 | Token数(十亿) | 占比 | 描述 | |:------------|:-----------|:-----------|:------------| | FineWeb2 | 506.7 | 84.3% | 高质量多语言网络爬取数据 | | DCLM (Dolmino) | 40.0 | 6.7% | 经过过滤的高质量英文网络数据 | | Starcoder | 17.2 | 2.9% | 代码仓库与代码文件 | | Arxiv | 5.4 | 0.9% | 学术预印本 | | Dolmino Math | 4.3 | 0.7% | 数学内容 | | Books | 3.9 | 0.7% | 文学作品与参考书籍 | | PeS2o | 3.2 | 0.5% | 科学论文 | | Tulu Flan | 3.1 | 0.5% | 指令跟随数据集 | | StackExchange | 3.0 | 0.5% | 问答论坛 | | StackExchange (Dolmino) | 2.8 | 0.5% | 精选问答内容 | | Wikipedia (MegaWika) | 1.2 | 0.2% | 百科全书条目 | | **总计** | **600.8** | **100.0%** | 用于上下文扩展的高质量数据 | ## 🌍 语言覆盖范围 本阶段覆盖**110种语言**及代码数据,采用τ=0.5的逆温度采样策略。相较于初始的60种语言,本次扩展新增: - **额外中等资源语言**:乌兹别克语、波斯尼亚语、加泰罗尼亚语、阿尔巴尼亚语,以及另外46种语言 - **质量提升**:采用经过过滤的FineWeb2-HQ与更高质量的DCLM数据集 - **更长上下文支持**:针对8192个Token的序列进行优化 ## ⚙️ 核心特性 - **上下文扩展**:调整旋转位置编码(RoPE)基频至160k,支持8192个Token的上下文长度 - **质量升级**:切换至经过过滤的更高质量数据集版本 - **降低掩码率**:掩码率从预训练阶段的30%降至15% - **语言扩展**:新增50种语言,同时维持数据集质量 ## 🚀 使用方法 如需进行训练中期微调,请参考ModernBERT仓库:https://github.com/AnswerDotAI/ModernBERT ### 直接访问 python from streaming import StreamingDataset # 加载流式数据集(StreamingDataset) dataset = StreamingDataset( remote='https://huggingface.co/datasets/jhu-clsp/mmbert-midtraining', local='/tmp/mmbert-midtraining-data', shuffle=True ) # 访问样本 for sample in dataset: text = sample['text'] # 处理你的数据... ## 🔗 相关资源 - **模型**:[mmBERT模型套件](https://huggingface.co/collections/jhu-clsp/mmbert-a-modern-multilingual-encoder-68b725831d7c6e3acc435ed4) - **第一阶段**:[预训练数据集](https://huggingface.co/datasets/jhu-clsp/mmbert-pretrain-p1-fineweb2-langs)(2.3万亿Token) - **第三阶段**:[衰减阶段数据集](https://huggingface.co/datasets/jhu-clsp/mmbert-decay)(1000亿Token) - **检查点**:[训练检查点](https://huggingface.co/datasets/jhu-clsp/mmbert-checkpoints) - **论文**:[Arxiv链接](https://arxiv.org/abs/2509.06888) - **代码**:[GitHub仓库](https://github.com/jhu-clsp/mmBERT) ## 引用格式 bibtex @misc{marone2025mmbertmodernmultilingualencoder, title={mmBERT: A Modern Multilingual Encoder with Annealed Language Learning}, author={Marc Marone and Orion Weller and William Fleshman and Eugene Yang and Dawn Lawrie and Benjamin Van Durme}, year={2025}, eprint={2509.06888}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2509.06888}, }
提供机构:
maas
创建时间:
2025-09-10
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作