five

mmBERT-pretrain-p3-others

收藏
魔搭社区2026-05-12 更新2025-09-13 收录
下载链接:
https://modelscope.cn/datasets/jhu-clsp/mmBERT-pretrain-p3-others
下载链接
链接失效反馈
官方服务:
资源简介:
# mmBERT Pre-training Data P3 [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) [![Paper](https://img.shields.io/badge/Paper-Arxiv-red)](https://arxiv.org/abs/2509.06888) [![Models](https://img.shields.io/badge/🤗%20Hugging%20Face-2%20Models-blue)](https://huggingface.co/collections/jhu-clsp/mmbert-a-modern-multilingual-encoder-68b725831d7c6e3acc435ed4) [![GitHub](https://img.shields.io/badge/GitHub-Code-black)](https://github.com/jhu-clsp/mmBERT) > **Phase 1 of 3**: Diverse multilingual pre-training data mixture (trained for 2.3T tokens) used to train the mmBERT model suite. NOTE: **this is only P3 of the pre-training data due to HF limits, you need to download and combine all three into one folder** This dataset contains the pre-training phase data used to train all [mmBERT encoder models](https://huggingface.co/collections/jhu-clsp/mmbert-a-modern-multilingual-encoder-68b725831d7c6e3acc435ed4). The data is provided in **MDS format** ready for use with [Composer](https://github.com/mosaicml/composer) and the [ModernBERT training repository](https://github.com/answerdotai/ModernBERT). ## 📊 Data Composition | Data Source | Tokens (B) | Percentage | Description | |:------------|:-----------|:-----------|:------------| | FineWeb2 | 1,196.6 | 60.2% | High-quality multilingual web crawl data | | DCLM | 600.0 | 30.2% | High-quality English web crawl data | | Starcoder | 100.6 | 5.1% | Code repositories and files | | Arxiv | 27.8 | 1.4% | Academic preprints | | StackExchange | 18.6 | 0.9% | Q&A forums | | Tulu Flan | 15.3 | 0.8% | Instruction-following data | | Dolmino Math | 11.2 | 0.6% | Mathematical content | | PeS2o | 8.4 | 0.4% | Scientific papers | | Wikipedia (MegaWika) | 4.7 | 0.2% | Encyclopedia articles | | Books | 4.3 | 0.2% | Literature and reference books | | StackExchange (Dolmino) | 1.4 | 0.1% | Curated Q&A content | | **Total** | **1,989.0** | **100.0%** | Diverse mixture for foundation training | ## 🌍 Language Coverage This phase covers **60 languages** plus code, with an inverse temperature sampling schedule starting at τ=0.7. Languages include: - **High-resource**: English (34.5%), Russian (5.8%), German (4.4%), Spanish (4.5%), French (4.0%), Chinese (5.2%) - **Mid-resource**: Italian, Portuguese, Japanese, Dutch, Polish, and 45 others - **Scripts**: Latin, Cyrillic, Arabic, Chinese, Japanese, Thai, and many more ## 🚀 Usage For pre-training, see the ModernBERT repo: https://github.com/AnswerDotAI/ModernBERT ### Direct Access Use the script at [this link](https://github.com/JHU-CLSP/mmBERT/blob/main/data/online_streaming.py) to load any section of the dataset on the fly. This will fail if you try to access too many samples though, due to HF rate-limiting. To download the full dataset, use HF Hub's [Snapshot Download](https://huggingface.co/docs/huggingface_hub/v1.0.0.rc6/en/package_reference/file_download#huggingface_hub.snapshot_download). # Process your data... ``` ## 🔗 Related Resources - **Models**: [mmBERT Model Suite](https://huggingface.co/collections/jhu-clsp/mmbert-a-modern-multilingual-encoder-68b725831d7c6e3acc435ed4) - **Phase 2**: [Mid-training Data](https://huggingface.co/datasets/jhu-clsp/mmbert-midtraining) (600B tokens) - **Phase 3**: [Decay Phase Data](https://huggingface.co/datasets/jhu-clsp/mmbert-decay) (100B tokens) - **Checkpoints**: [Training Checkpoints](https://huggingface.co/datasets/jhu-clsp/mmbert-checkpoints) - **Paper**: [Arxiv link](https://arxiv.org/abs/2509.06888) - **Code**: [GitHub Repository](https://github.com/jhu-clsp/mmBERT) ## Citation ```bibtex @misc{marone2025mmbertmodernmultilingualencoder, title={mmBERT: A Modern Multilingual Encoder with Annealed Language Learning}, author={Marc Marone and Orion Weller and William Fleshman and Eugene Yang and Dawn Lawrie and Benjamin Van Durme}, year={2025}, eprint={2509.06888}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2509.06888}, } ```

# mmBERT 预训练数据集 P3 [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) [![Paper](https://img.shields.io/badge/Paper-Arxiv-red)](https://arxiv.org/abs/2509.06888) [![Models](https://img.shields.io/badge/🤗%20Hugging%20Face-2%20Models-blue)](https://huggingface.co/collections/jhu-clsp/mmbert-a-modern-multilingual-encoder-68b725831d7c6e3acc435ed4) [![GitHub](https://img.shields.io/badge/GitHub-Code-black)](https://github.com/jhu-clsp/mmBERT) > **三阶段中的第一阶段**:用于训练mmBERT模型套件的多样化多语言预训练混合数据(已训练2.3万亿Token)。 注意:**由于Hugging Face的限制,本数据集仅为预训练数据的P3阶段,您需要下载全部三个阶段的数据并合并至同一文件夹后方可使用。** 本数据集包含用于训练所有[mmBERT编码器模型](https://huggingface.co/collections/jhu-clsp/mmbert-a-modern-multilingual-encoder-68b725831d7c6e3acc435ed4)的预训练阶段数据。数据以**MDS格式**提供,可直接与[Composer](https://github.com/mosaicml/composer)以及[ModernBERT训练仓库](https://github.com/answerdotai/ModernBERT)配合使用。 ## 📊 数据构成 | 数据来源 | Token数量(十亿) | 占比 | 描述 | |:------------|:-----------|:-----------|:------------| | FineWeb2 | 1196.6 | 60.2% | 高质量多语言网络爬取数据 | | DCLM | 600.0 | 30.2% | 高质量英文网络爬取数据 | | Starcoder | 100.6 | 5.1% | 代码仓库与代码文件 | | Arxiv | 27.8 | 1.4% | 学术预印本 | | StackExchange | 18.6 | 0.9% | 问答论坛 | | Tulu Flan | 15.3 | 0.8% | 指令遵循数据集 | | Dolmino Math | 11.2 | 0.6% | 数学内容数据集 | | PeS2o | 8.4 | 0.4% | 学术论文 | | Wikipedia(MegaWika) | 4.7 | 0.2% | 百科全书文章 | | Books | 4.3 | 0.2% | 文学与参考书籍 | | StackExchange(Dolmino) | 1.4 | 0.1% | 精选问答内容 | | **总计** | **1989.0** | **100.0%** | 用于基础训练的多样化混合数据集 | ## 🌍 语言覆盖范围 本阶段覆盖**60种语言**及代码数据,采用起始温度τ=0.7的逆温度采样策略。涵盖语言包括: - **高资源语言**:英语(34.5%)、俄语(5.8%)、德语(4.4%)、西班牙语(4.5%)、法语(4.0%)、中文(5.2%) - **中资源语言**:意大利语、葡萄牙语、日语、荷兰语、波兰语以及其余45种语言 - **书写系统**:拉丁字母、西里尔字母、阿拉伯字母、汉字、日文假名、泰文等多种书写系统 ## 🚀 使用方法 如需进行预训练,请参考ModernBERT仓库:https://github.com/AnswerDotAI/ModernBERT ### 直接访问 可通过[此链接](https://github.com/JHU-CLSP/mmBERT/blob/main/data/online_streaming.py)中的脚本实时加载数据集的任意部分。但由于Hugging Face的请求限流限制,若尝试访问过多样本将导致请求失败。如需下载完整数据集,请使用Hugging Face Hub的[快照下载功能](https://huggingface.co/docs/huggingface_hub/v1.0.0.rc6/en/package_reference/file_download#huggingface_hub.snapshot_download)。 # 处理你的数据... ## 🔗 相关资源 - **模型**:[mmBERT模型套件](https://huggingface.co/collections/jhu-clsp/mmbert-a-modern-multilingual-encoder-68b725831d7c6e3acc435ed4) - **第二阶段**:[中期训练数据集](https://huggingface.co/datasets/jhu-clsp/mmbert-midtraining)(6000亿Token) - **第三阶段**:[衰减阶段数据集](https://huggingface.co/datasets/jhu-clsp/mmbert-decay)(1000亿Token) - **检查点**:[训练检查点](https://huggingface.co/datasets/jhu-clsp/mmbert-checkpoints) - **论文**:[Arxiv链接](https://arxiv.org/abs/2509.06888) - **代码**:[GitHub仓库](https://github.com/jhu-clsp/mmBERT) ## 引用格式 bibtex @misc{marone2025mmbertmodernmultilingualencoder, title={mmBERT: A Modern Multilingual Encoder with Annealed Language Learning}, author={Marc Marone and Orion Weller and William Fleshman and Eugene Yang and Dawn Lawrie and Benjamin Van Durme}, year={2025}, eprint={2509.06888}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2509.06888}, }
提供机构:
maas
创建时间:
2025-09-11
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作