mmBERT-pretrain-p3-others
收藏魔搭社区2026-05-12 更新2025-09-13 收录
下载链接:
https://modelscope.cn/datasets/jhu-clsp/mmBERT-pretrain-p3-others
下载链接
链接失效反馈官方服务:
资源简介:
# mmBERT Pre-training Data P3
[](https://opensource.org/licenses/MIT)
[](https://arxiv.org/abs/2509.06888)
[](https://huggingface.co/collections/jhu-clsp/mmbert-a-modern-multilingual-encoder-68b725831d7c6e3acc435ed4)
[](https://github.com/jhu-clsp/mmBERT)
> **Phase 1 of 3**: Diverse multilingual pre-training data mixture (trained for 2.3T tokens) used to train the mmBERT model suite.
NOTE: **this is only P3 of the pre-training data due to HF limits, you need to download and combine all three into one folder**
This dataset contains the pre-training phase data used to train all [mmBERT encoder models](https://huggingface.co/collections/jhu-clsp/mmbert-a-modern-multilingual-encoder-68b725831d7c6e3acc435ed4). The data is provided in **MDS format** ready for use with [Composer](https://github.com/mosaicml/composer) and the [ModernBERT training repository](https://github.com/answerdotai/ModernBERT).
## 📊 Data Composition
| Data Source | Tokens (B) | Percentage | Description |
|:------------|:-----------|:-----------|:------------|
| FineWeb2 | 1,196.6 | 60.2% | High-quality multilingual web crawl data |
| DCLM | 600.0 | 30.2% | High-quality English web crawl data |
| Starcoder | 100.6 | 5.1% | Code repositories and files |
| Arxiv | 27.8 | 1.4% | Academic preprints |
| StackExchange | 18.6 | 0.9% | Q&A forums |
| Tulu Flan | 15.3 | 0.8% | Instruction-following data |
| Dolmino Math | 11.2 | 0.6% | Mathematical content |
| PeS2o | 8.4 | 0.4% | Scientific papers |
| Wikipedia (MegaWika) | 4.7 | 0.2% | Encyclopedia articles |
| Books | 4.3 | 0.2% | Literature and reference books |
| StackExchange (Dolmino) | 1.4 | 0.1% | Curated Q&A content |
| **Total** | **1,989.0** | **100.0%** | Diverse mixture for foundation training |
## 🌍 Language Coverage
This phase covers **60 languages** plus code, with an inverse temperature sampling schedule starting at τ=0.7. Languages include:
- **High-resource**: English (34.5%), Russian (5.8%), German (4.4%), Spanish (4.5%), French (4.0%), Chinese (5.2%)
- **Mid-resource**: Italian, Portuguese, Japanese, Dutch, Polish, and 45 others
- **Scripts**: Latin, Cyrillic, Arabic, Chinese, Japanese, Thai, and many more
## 🚀 Usage
For pre-training, see the ModernBERT repo: https://github.com/AnswerDotAI/ModernBERT
### Direct Access
Use the script at [this link](https://github.com/JHU-CLSP/mmBERT/blob/main/data/online_streaming.py) to load any section of the dataset on the fly. This will fail if you try to access too many samples though, due to HF rate-limiting. To download the full dataset, use HF Hub's [Snapshot Download](https://huggingface.co/docs/huggingface_hub/v1.0.0.rc6/en/package_reference/file_download#huggingface_hub.snapshot_download).
# Process your data...
```
## 🔗 Related Resources
- **Models**: [mmBERT Model Suite](https://huggingface.co/collections/jhu-clsp/mmbert-a-modern-multilingual-encoder-68b725831d7c6e3acc435ed4)
- **Phase 2**: [Mid-training Data](https://huggingface.co/datasets/jhu-clsp/mmbert-midtraining) (600B tokens)
- **Phase 3**: [Decay Phase Data](https://huggingface.co/datasets/jhu-clsp/mmbert-decay) (100B tokens)
- **Checkpoints**: [Training Checkpoints](https://huggingface.co/datasets/jhu-clsp/mmbert-checkpoints)
- **Paper**: [Arxiv link](https://arxiv.org/abs/2509.06888)
- **Code**: [GitHub Repository](https://github.com/jhu-clsp/mmBERT)
## Citation
```bibtex
@misc{marone2025mmbertmodernmultilingualencoder,
title={mmBERT: A Modern Multilingual Encoder with Annealed Language Learning},
author={Marc Marone and Orion Weller and William Fleshman and Eugene Yang and Dawn Lawrie and Benjamin Van Durme},
year={2025},
eprint={2509.06888},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2509.06888},
}
```
# mmBERT 预训练数据集 P3
[](https://opensource.org/licenses/MIT)
[](https://arxiv.org/abs/2509.06888)
[](https://huggingface.co/collections/jhu-clsp/mmbert-a-modern-multilingual-encoder-68b725831d7c6e3acc435ed4)
[](https://github.com/jhu-clsp/mmBERT)
> **三阶段中的第一阶段**:用于训练mmBERT模型套件的多样化多语言预训练混合数据(已训练2.3万亿Token)。
注意:**由于Hugging Face的限制,本数据集仅为预训练数据的P3阶段,您需要下载全部三个阶段的数据并合并至同一文件夹后方可使用。**
本数据集包含用于训练所有[mmBERT编码器模型](https://huggingface.co/collections/jhu-clsp/mmbert-a-modern-multilingual-encoder-68b725831d7c6e3acc435ed4)的预训练阶段数据。数据以**MDS格式**提供,可直接与[Composer](https://github.com/mosaicml/composer)以及[ModernBERT训练仓库](https://github.com/answerdotai/ModernBERT)配合使用。
## 📊 数据构成
| 数据来源 | Token数量(十亿) | 占比 | 描述 |
|:------------|:-----------|:-----------|:------------|
| FineWeb2 | 1196.6 | 60.2% | 高质量多语言网络爬取数据 |
| DCLM | 600.0 | 30.2% | 高质量英文网络爬取数据 |
| Starcoder | 100.6 | 5.1% | 代码仓库与代码文件 |
| Arxiv | 27.8 | 1.4% | 学术预印本 |
| StackExchange | 18.6 | 0.9% | 问答论坛 |
| Tulu Flan | 15.3 | 0.8% | 指令遵循数据集 |
| Dolmino Math | 11.2 | 0.6% | 数学内容数据集 |
| PeS2o | 8.4 | 0.4% | 学术论文 |
| Wikipedia(MegaWika) | 4.7 | 0.2% | 百科全书文章 |
| Books | 4.3 | 0.2% | 文学与参考书籍 |
| StackExchange(Dolmino) | 1.4 | 0.1% | 精选问答内容 |
| **总计** | **1989.0** | **100.0%** | 用于基础训练的多样化混合数据集 |
## 🌍 语言覆盖范围
本阶段覆盖**60种语言**及代码数据,采用起始温度τ=0.7的逆温度采样策略。涵盖语言包括:
- **高资源语言**:英语(34.5%)、俄语(5.8%)、德语(4.4%)、西班牙语(4.5%)、法语(4.0%)、中文(5.2%)
- **中资源语言**:意大利语、葡萄牙语、日语、荷兰语、波兰语以及其余45种语言
- **书写系统**:拉丁字母、西里尔字母、阿拉伯字母、汉字、日文假名、泰文等多种书写系统
## 🚀 使用方法
如需进行预训练,请参考ModernBERT仓库:https://github.com/AnswerDotAI/ModernBERT
### 直接访问
可通过[此链接](https://github.com/JHU-CLSP/mmBERT/blob/main/data/online_streaming.py)中的脚本实时加载数据集的任意部分。但由于Hugging Face的请求限流限制,若尝试访问过多样本将导致请求失败。如需下载完整数据集,请使用Hugging Face Hub的[快照下载功能](https://huggingface.co/docs/huggingface_hub/v1.0.0.rc6/en/package_reference/file_download#huggingface_hub.snapshot_download)。
# 处理你的数据...
## 🔗 相关资源
- **模型**:[mmBERT模型套件](https://huggingface.co/collections/jhu-clsp/mmbert-a-modern-multilingual-encoder-68b725831d7c6e3acc435ed4)
- **第二阶段**:[中期训练数据集](https://huggingface.co/datasets/jhu-clsp/mmbert-midtraining)(6000亿Token)
- **第三阶段**:[衰减阶段数据集](https://huggingface.co/datasets/jhu-clsp/mmbert-decay)(1000亿Token)
- **检查点**:[训练检查点](https://huggingface.co/datasets/jhu-clsp/mmbert-checkpoints)
- **论文**:[Arxiv链接](https://arxiv.org/abs/2509.06888)
- **代码**:[GitHub仓库](https://github.com/jhu-clsp/mmBERT)
## 引用格式
bibtex
@misc{marone2025mmbertmodernmultilingualencoder,
title={mmBERT: A Modern Multilingual Encoder with Annealed Language Learning},
author={Marc Marone and Orion Weller and William Fleshman and Eugene Yang and Dawn Lawrie and Benjamin Van Durme},
year={2025},
eprint={2509.06888},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2509.06888},
}
提供机构:
maas
创建时间:
2025-09-11



