five

mmBERT-pretrain-p1-fineweb2-langs

收藏
魔搭社区2025-09-26 更新2025-09-20 收录
下载链接:
https://modelscope.cn/datasets/jhu-clsp/mmBERT-pretrain-p1-fineweb2-langs
下载链接
链接失效反馈
官方服务:
资源简介:
# mmBERT Pre-training Data P1 [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) [![Paper](https://img.shields.io/badge/Paper-Arxiv-red)](https://arxiv.org/abs/2509.06888) [![Models](https://img.shields.io/badge/🤗%20Hugging%20Face-2%20Models-blue)](https://huggingface.co/collections/jhu-clsp/mmbert-a-modern-multilingual-encoder-68b725831d7c6e3acc435ed4) [![GitHub](https://img.shields.io/badge/GitHub-Code-black)](https://github.com/jhu-clsp/mmBERT) > **Phase 1 of 3**: Diverse multilingual pre-training data mixture (trained for 2.3T tokens) used to train the mmBERT model suite. NOTE: **this is only P1 of the pre-training data due to HF limits, you need to download and combine all three into one folder** This dataset contains the pre-training phase data used to train all [mmBERT encoder models](https://huggingface.co/collections/jhu-clsp/mmbert-a-modern-multilingual-encoder-68b725831d7c6e3acc435ed4). The data is provided in **MDS format** ready for use with [Composer](https://github.com/mosaicml/composer) and the [ModernBERT training repository](https://github.com/answerdotai/ModernBERT). ## 📊 Data Composition | Data Source | Tokens (B) | Percentage | Description | |:------------|:-----------|:-----------|:------------| | FineWeb2 | 1,196.6 | 60.2% | High-quality multilingual web crawl data | | DCLM | 600.0 | 30.2% | High-quality English web crawl data | | Starcoder | 100.6 | 5.1% | Code repositories and files | | Arxiv | 27.8 | 1.4% | Academic preprints | | StackExchange | 18.6 | 0.9% | Q&A forums | | Tulu Flan | 15.3 | 0.8% | Instruction-following data | | Dolmino Math | 11.2 | 0.6% | Mathematical content | | PeS2o | 8.4 | 0.4% | Scientific papers | | Wikipedia (MegaWika) | 4.7 | 0.2% | Encyclopedia articles | | Books | 4.3 | 0.2% | Literature and reference books | | StackExchange (Dolmino) | 1.4 | 0.1% | Curated Q&A content | | **Total** | **1,989.0** | **100.0%** | Diverse mixture for foundation training | ## 🌍 Language Coverage This phase covers **60 languages** plus code, with an inverse temperature sampling schedule starting at τ=0.7. Languages include: - **High-resource**: English (34.5%), Russian (5.8%), German (4.4%), Spanish (4.5%), French (4.0%), Chinese (5.2%) - **Mid-resource**: Italian, Portuguese, Japanese, Dutch, Polish, and 45 others - **Scripts**: Latin, Cyrillic, Arabic, Chinese, Japanese, Thai, and many more ## 🚀 Usage For pre-training, see the ModernBERT repo: https://github.com/AnswerDotAI/ModernBERT ### Direct Access ```python from streaming import StreamingDataset # Load the streaming dataset dataset = StreamingDataset( remote='https://huggingface.co/datasets/jhu-clsp/mmbert-pretrain-p1-fineweb2-langs', local='/tmp/mmbert-pretraining-data', shuffle=True ) # Access samples for sample in dataset: text = sample['text'] # Process your data... ``` ## 🔗 Related Resources - **Models**: [mmBERT Model Suite](https://huggingface.co/collections/jhu-clsp/mmbert-a-modern-multilingual-encoder-68b725831d7c6e3acc435ed4) - **Phase 2**: [Mid-training Data](https://huggingface.co/datasets/jhu-clsp/mmbert-midtraining) (600B tokens) - **Phase 3**: [Decay Phase Data](https://huggingface.co/datasets/jhu-clsp/mmbert-decay) (100B tokens) - **Checkpoints**: [Training Checkpoints](https://huggingface.co/datasets/jhu-clsp/mmbert-checkpoints) - **Paper**: [Arxiv link](https://arxiv.org/abs/2509.06888) - **Hugging Face Paper**: [mmBERT: A Modern Multilingual Encoder with Annealed Language Learning](https://huggingface.co/papers/2509.06888) - **Code**: [GitHub Repository](https://github.com/jhu-clsp/mmBERT) ## Citation ```bibtex @misc{marone2025mmbertmodernmultilingualencoder, title={mmBERT: A Modern Multilingual Encoder with Annealed Language Learning}, author={Marc Marone and Orion Weller and William Fleshman and Eugene Yang and Dawn Lawrie and Benjamin Van Durme}, year={2025}, eprint={2509.06888}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2509.06888}, } ```

# mmBERT预训练数据集P1 [![许可证:MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) [![论文](https://img.shields.io/badge/Paper-Arxiv-red)](https://arxiv.org/abs/2509.06888) [![🤗 Hugging Face 模型](https://img.shields.io/badge/🤗%20Hugging%20Face-2%20Models-blue)](https://huggingface.co/collections/jhu-clsp/mmbert-a-modern-multilingual-encoder-68b725831d7c6e3acc435ed4) [![GitHub 代码](https://img.shields.io/badge/GitHub-Code-black)](https://github.com/jhu-clsp/mmBERT) > **3个阶段中的第1阶段**:用于训练mmBERT模型套件的多样化多语言预训练数据混合集(已训练2.3万亿Token)。 **注:由于Hugging Face平台限制,本数据集仅为预训练数据的P1部分,您需下载全部三个阶段的数据并合并至同一文件夹中** 本数据集包含用于训练所有[mmBERT编码器模型](https://huggingface.co/collections/jhu-clsp/mmbert-a-modern-multilingual-encoder-68b725831d7c6e3acc435ed4)的预训练阶段数据。该数据以**MDS格式**提供,可直接配合[Composer](https://github.com/mosaicml/composer)与[ModernBERT训练仓库](https://github.com/answerdotai/ModernBERT)使用。 ## 📊 数据构成 | 数据源 | Token数(十亿) | 占比 | 描述 | |:------------|:-----------|:-----------|:------------| | FineWeb2 | 1196.6 | 60.2% | 高质量多语言网络爬取数据 | | DCLM | 600.0 | 30.2% | 高质量英文网络爬取数据 | | Starcoder | 100.6 | 5.1% | 代码仓库与文件 | | Arxiv | 27.8 | 1.4% | 学术预印本 | | StackExchange | 18.6 | 0.9% | 问答论坛 | | Tulu Flan | 15.3 | 0.8% | 指令跟随数据 | | Dolmino Math | 11.2 | 0.6% | 数学内容 | | PeS2o | 8.4 | 0.4% | 科学论文 | | Wikipedia (MegaWika) | 4.7 | 0.2% | 百科全书条目 | | Books | 4.3 | 0.2% | 文学与参考书籍 | | StackExchange (Dolmino) | 1.4 | 0.1% | 精选问答内容 | | **总计** | **1989.0** | **100.0%** | 用于基础训练的多样化混合数据 | ## 🌍 语言覆盖范围 本阶段覆盖**60种语言**及代码数据,采用起始温度τ=0.7的逆温度采样策略。语言覆盖范围包括: - **高资源语言**:英语(34.5%)、俄语(5.8%)、德语(4.4%)、西班牙语(4.5%)、法语(4.0%)、汉语(5.2%) - **中资源语言**:意大利语、葡萄牙语、日语、荷兰语、波兰语等共计45种语言 - **书写系统**:拉丁语、西里尔字母、阿拉伯语、汉语、日语、泰语等多种文字系统 ## 🚀 使用方法 如需进行预训练,请参考ModernBERT仓库:https://github.com/AnswerDotAI/ModernBERT ### 直接访问 python from streaming import StreamingDataset # 加载流式数据集(StreamingDataset) dataset = StreamingDataset( remote='https://huggingface.co/datasets/jhu-clsp/mmbert-pretrain-p1-fineweb2-langs', local='/tmp/mmbert-pretraining-data', shuffle=True ) # 访问数据样本 for sample in dataset: text = sample['text'] # 处理您的数据... ## 🔗 相关资源 - **模型**:[mmBERT模型套件](https://huggingface.co/collections/jhu-clsp/mmbert-a-modern-multilingual-encoder-68b725831d7c6e3acc435ed4) - **阶段2**:[中期训练数据](https://huggingface.co/datasets/jhu-clsp/mmbert-midtraining)(6000亿Token) - **阶段3**:[衰减阶段数据](https://huggingface.co/datasets/jhu-clsp/mmbert-decay)(1000亿Token) - **训练检查点**:[训练检查点](https://huggingface.co/datasets/jhu-clsp/mmbert-checkpoints) - **论文**:[Arxiv链接](https://arxiv.org/abs/2509.06888) - **Hugging Face论文页**:[mmBERT:带有退火语言学习的现代多语言编码器](https://huggingface.co/papers/2509.06888) - **代码**:[GitHub仓库](https://github.com/jhu-clsp/mmBERT) ## 引用 bibtex @misc{marone2025mmbertmodernmultilingualencoder, title={mmBERT: A Modern Multilingual Encoder with Annealed Language Learning}, author={Marc Marone and Orion Weller and William Fleshman and Eugene Yang and Dawn Lawrie and Benjamin Van Durme}, year={2025}, eprint={2509.06888}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2509.06888}, }
提供机构:
maas
创建时间:
2025-09-10
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作