Ayushnangia/dolma3-hq-2M-modernbert
收藏Hugging Face2026-01-19 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Ayushnangia/dolma3-hq-2M-modernbert
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: apache-2.0
size_categories:
- 1M<n<10M
task_categories:
- text-generation
- fill-mask
tags:
- pretraining
- modernbert
- diffusion-lm
- dolma
pretty_name: Dolma3 High-Quality 2M (ModernBERT Filtered)
dataset_info:
features:
- name: text
dtype: string
splits:
- name: train
num_examples: 2000000
---
# Dolma3 High-Quality 2M (ModernBERT Filtered)
A curated subset of 2 million high-quality text samples from [allenai/dolma3_dolmino_mix-100B-1125](https://huggingface.co/datasets/allenai/dolma3_dolmino_mix-100B-1125), filtered to fit within ModernBERT's 8192 token context window.
## Dataset Description
This dataset is designed for pretraining diffusion language models based on ModernBERT. Each sample has been:
1. **Source filtered**: Only from `ingredient1-common_crawl-high-quality` folders (highest quality web text)
2. **Length filtered**: Minimum 200 characters
3. **Token filtered**: Maximum 8192 tokens using ModernBERT tokenizer (samples exceeding this are excluded, not truncated)
4. **Randomly sampled**: True random sampling from 2.4M collected samples down to 2M
## Usage
```python
from datasets import load_dataset
dataset = load_dataset("Ayushnangia/dolma3-hq-2M-modernbert")
print(f"Samples: {len(dataset['train']):,}")
print(dataset['train'][0]['text'][:500])
```
### For ModernBERT Diffusion LM Pretraining
```bash
python scripts/mb_pretrain.py \
--dataset Ayushnangia/dolma3-hq-2M-modernbert \
--text-column text
```
## Dataset Statistics
| Statistic | Value |
|-----------|-------|
| Total samples | 2,000,000 |
| Source | Dolma3 ingredient1 high-quality |
| Min chars | 200 |
| Max tokens | 8192 (ModernBERT) |
| Language | English |
| Size | ~11 GB |
## Source
- **Original dataset**: [allenai/dolma3_dolmino_mix-100B-1125](https://huggingface.co/datasets/allenai/dolma3_dolmino_mix-100B-1125)
- **Tokenizer**: [answerdotai/ModernBERT-large](https://huggingface.co/answerdotai/ModernBERT-large)
## License
Apache 2.0 (following Dolma3's license)
## Citation
If you use this dataset, please cite the original Dolma3 dataset:
```bibtex
@article{dolma,
title={Dolma: An Open Corpus of Three Trillion Tokens for Language Model Pretraining Research},
author={Soldaini, Luca and others},
journal={arXiv preprint},
year={2024}
}
```
提供机构:
Ayushnangia



