mmBERT-midtraining-data
收藏魔搭社区2025-09-28 更新2025-09-20 收录
下载链接:
https://modelscope.cn/datasets/jhu-clsp/mmBERT-midtraining-data
下载链接
链接失效反馈官方服务:
资源简介:
# mmBERT Mid-training Data
[](https://opensource.org/licenses/MIT)
[](https://arxiv.org/abs/2509.06888)
[](https://huggingface.co/collections/jhu-clsp/mmbert-a-modern-multilingual-encoder-68b725831d7c6e3acc435ed4)
[](https://github.com/jhu-clsp/mmBERT)
> **Phase 2 of 3**: High-quality mid-training data mixture (600B tokens) with context extension to 8192 tokens.
This dataset contains the mid-training phase data used to train all [mmBERT encoder models](https://huggingface.co/collections/jhu-clsp/mmbert-a-modern-multilingual-encoder-68b725831d7c6e3acc435ed4). This phase focuses on higher quality data sources and extends the context length from 1024 to 8192 tokens. The data is provided in **MDS format** ready for use with [Composer](https://github.com/mosaicml/composer) and the [ModernBERT training repository](https://github.com/answerdotai/ModernBERT).
## 📊 Data Composition
| Data Source | Tokens (B) | Percentage | Description |
|:------------|:-----------|:-----------|:------------|
| FineWeb2 | 506.7 | 84.3% | High-quality multilingual web crawl data |
| DCLM (Dolmino) | 40.0 | 6.7% | Filtered high-quality English web data |
| Starcoder | 17.2 | 2.9% | Code repositories and files |
| Arxiv | 5.4 | 0.9% | Academic preprints |
| Dolmino Math | 4.3 | 0.7% | Mathematical content |
| Books | 3.9 | 0.7% | Literature and reference books |
| PeS2o | 3.2 | 0.5% | Scientific papers |
| Tulu Flan | 3.1 | 0.5% | Instruction-following data |
| StackExchange | 3.0 | 0.5% | Q&A forums |
| StackExchange (Dolmino) | 2.8 | 0.5% | Curated Q&A content |
| Wikipedia (MegaWika) | 1.2 | 0.2% | Encyclopedia articles |
| **Total** | **600.8** | **100.0%** | High-quality data for context extension |
## 🌍 Language Coverage
This phase covers **110 languages** plus code, with inverse temperature sampling at τ=0.5. Expands from the initial 60 languages to include:
- **Additional mid-resource languages**: Uzbek, Bosnian, Catalan, Albanian, and 46 others
- **Enhanced quality**: Uses filtered FineWeb2-HQ and higher quality DCLM
- **Longer contexts**: Optimized for 8192 token sequences
## ⚙️ Key Features
- **Context Extension**: RoPE base frequency adjusted to 160k for 8192 token support
- **Quality Upgrade**: Switches to filtered, higher-quality versions of datasets
- **Reduced Masking**: Mask rate lowered to 15% (from 30% in pre-training)
- **Language Expansion**: Adds 50 new languages while maintaining data quality
## 🚀 Usage
For mid-training, see the ModernBERT repo: https://github.com/AnswerDotAI/ModernBERT
### Direct Access
```python
from streaming import StreamingDataset
# Load the streaming dataset
dataset = StreamingDataset(
remote='https://huggingface.co/datasets/jhu-clsp/mmbert-midtraining',
local='/tmp/mmbert-midtraining-data',
shuffle=True
)
# Access samples
for sample in dataset:
text = sample['text']
# Process your data...
```
## 🔗 Related Resources
- **Models**: [mmBERT Model Suite](https://huggingface.co/collections/jhu-clsp/mmbert-a-modern-multilingual-encoder-68b725831d7c6e3acc435ed4)
- **Phase 1**: [Pre-training Data](https://huggingface.co/datasets/jhu-clsp/mmbert-pretrain-p1-fineweb2-langs) (2.3T tokens)
- **Phase 3**: [Decay Phase Data](https://huggingface.co/datasets/jhu-clsp/mmbert-decay) (100B tokens)
- **Checkpoints**: [Training Checkpoints](https://huggingface.co/datasets/jhu-clsp/mmbert-checkpoints)
- **Paper**: [Arxiv link](https://arxiv.org/abs/2509.06888)
- **Code**: [GitHub Repository](https://github.com/jhu-clsp/mmBERT)
## Citation
```bibtex
@misc{marone2025mmbertmodernmultilingualencoder,
title={mmBERT: A Modern Multilingual Encoder with Annealed Language Learning},
author={Marc Marone and Orion Weller and William Fleshman and Eugene Yang and Dawn Lawrie and Benjamin Van Durme},
year={2025},
eprint={2509.06888},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2509.06888},
}
```
# mmBERT 训练中期数据集
[](https://opensource.org/licenses/MIT)
[](https://arxiv.org/abs/2509.06888)
[](https://huggingface.co/collections/jhu-clsp/mmbert-a-modern-multilingual-encoder-68b725831d7c6e3acc435ed4)
[](https://github.com/jhu-clsp/mmBERT)
> **三阶段中的第二阶段**:高质量训练中期混合数据集(含6000亿Token),上下文长度扩展至8192个Token。
本数据集为所有mmBERT编码器模型的训练中期阶段所用数据。本阶段聚焦于更高质量的数据源,将上下文长度从1024个Token扩展至8192个Token。数据集以**MDS格式**提供,可直接用于[Composer](https://github.com/mosaicml/composer)与[ModernBERT训练仓库](https://github.com/answerdotai/ModernBERT)。
## 📊 数据集构成
| 数据源 | Token数(十亿) | 占比 | 描述 |
|:------------|:-----------|:-----------|:------------|
| FineWeb2 | 506.7 | 84.3% | 高质量多语言网络爬取数据 |
| DCLM (Dolmino) | 40.0 | 6.7% | 经过过滤的高质量英文网络数据 |
| Starcoder | 17.2 | 2.9% | 代码仓库与代码文件 |
| Arxiv | 5.4 | 0.9% | 学术预印本 |
| Dolmino Math | 4.3 | 0.7% | 数学内容 |
| Books | 3.9 | 0.7% | 文学作品与参考书籍 |
| PeS2o | 3.2 | 0.5% | 科学论文 |
| Tulu Flan | 3.1 | 0.5% | 指令跟随数据集 |
| StackExchange | 3.0 | 0.5% | 问答论坛 |
| StackExchange (Dolmino) | 2.8 | 0.5% | 精选问答内容 |
| Wikipedia (MegaWika) | 1.2 | 0.2% | 百科全书条目 |
| **总计** | **600.8** | **100.0%** | 用于上下文扩展的高质量数据 |
## 🌍 语言覆盖范围
本阶段覆盖**110种语言**及代码数据,采用τ=0.5的逆温度采样策略。相较于初始的60种语言,本次扩展新增:
- **额外中等资源语言**:乌兹别克语、波斯尼亚语、加泰罗尼亚语、阿尔巴尼亚语,以及另外46种语言
- **质量提升**:采用经过过滤的FineWeb2-HQ与更高质量的DCLM数据集
- **更长上下文支持**:针对8192个Token的序列进行优化
## ⚙️ 核心特性
- **上下文扩展**:调整旋转位置编码(RoPE)基频至160k,支持8192个Token的上下文长度
- **质量升级**:切换至经过过滤的更高质量数据集版本
- **降低掩码率**:掩码率从预训练阶段的30%降至15%
- **语言扩展**:新增50种语言,同时维持数据集质量
## 🚀 使用方法
如需进行训练中期微调,请参考ModernBERT仓库:https://github.com/AnswerDotAI/ModernBERT
### 直接访问
python
from streaming import StreamingDataset
# 加载流式数据集(StreamingDataset)
dataset = StreamingDataset(
remote='https://huggingface.co/datasets/jhu-clsp/mmbert-midtraining',
local='/tmp/mmbert-midtraining-data',
shuffle=True
)
# 访问样本
for sample in dataset:
text = sample['text']
# 处理你的数据...
## 🔗 相关资源
- **模型**:[mmBERT模型套件](https://huggingface.co/collections/jhu-clsp/mmbert-a-modern-multilingual-encoder-68b725831d7c6e3acc435ed4)
- **第一阶段**:[预训练数据集](https://huggingface.co/datasets/jhu-clsp/mmbert-pretrain-p1-fineweb2-langs)(2.3万亿Token)
- **第三阶段**:[衰减阶段数据集](https://huggingface.co/datasets/jhu-clsp/mmbert-decay)(1000亿Token)
- **检查点**:[训练检查点](https://huggingface.co/datasets/jhu-clsp/mmbert-checkpoints)
- **论文**:[Arxiv链接](https://arxiv.org/abs/2509.06888)
- **代码**:[GitHub仓库](https://github.com/jhu-clsp/mmBERT)
## 引用格式
bibtex
@misc{marone2025mmbertmodernmultilingualencoder,
title={mmBERT: A Modern Multilingual Encoder with Annealed Language Learning},
author={Marc Marone and Orion Weller and William Fleshman and Eugene Yang and Dawn Lawrie and Benjamin Van Durme},
year={2025},
eprint={2509.06888},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2509.06888},
}
提供机构:
maas
创建时间:
2025-09-10



