five

Ayushnangia/dolma3-hq-2M-modernbert

收藏
Hugging Face2026-01-19 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Ayushnangia/dolma3-hq-2M-modernbert
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en license: apache-2.0 size_categories: - 1M<n<10M task_categories: - text-generation - fill-mask tags: - pretraining - modernbert - diffusion-lm - dolma pretty_name: Dolma3 High-Quality 2M (ModernBERT Filtered) dataset_info: features: - name: text dtype: string splits: - name: train num_examples: 2000000 --- # Dolma3 High-Quality 2M (ModernBERT Filtered) A curated subset of 2 million high-quality text samples from [allenai/dolma3_dolmino_mix-100B-1125](https://huggingface.co/datasets/allenai/dolma3_dolmino_mix-100B-1125), filtered to fit within ModernBERT's 8192 token context window. ## Dataset Description This dataset is designed for pretraining diffusion language models based on ModernBERT. Each sample has been: 1. **Source filtered**: Only from `ingredient1-common_crawl-high-quality` folders (highest quality web text) 2. **Length filtered**: Minimum 200 characters 3. **Token filtered**: Maximum 8192 tokens using ModernBERT tokenizer (samples exceeding this are excluded, not truncated) 4. **Randomly sampled**: True random sampling from 2.4M collected samples down to 2M ## Usage ```python from datasets import load_dataset dataset = load_dataset("Ayushnangia/dolma3-hq-2M-modernbert") print(f"Samples: {len(dataset['train']):,}") print(dataset['train'][0]['text'][:500]) ``` ### For ModernBERT Diffusion LM Pretraining ```bash python scripts/mb_pretrain.py \ --dataset Ayushnangia/dolma3-hq-2M-modernbert \ --text-column text ``` ## Dataset Statistics | Statistic | Value | |-----------|-------| | Total samples | 2,000,000 | | Source | Dolma3 ingredient1 high-quality | | Min chars | 200 | | Max tokens | 8192 (ModernBERT) | | Language | English | | Size | ~11 GB | ## Source - **Original dataset**: [allenai/dolma3_dolmino_mix-100B-1125](https://huggingface.co/datasets/allenai/dolma3_dolmino_mix-100B-1125) - **Tokenizer**: [answerdotai/ModernBERT-large](https://huggingface.co/answerdotai/ModernBERT-large) ## License Apache 2.0 (following Dolma3's license) ## Citation If you use this dataset, please cite the original Dolma3 dataset: ```bibtex @article{dolma, title={Dolma: An Open Corpus of Three Trillion Tokens for Language Model Pretraining Research}, author={Soldaini, Luca and others}, journal={arXiv preprint}, year={2024} } ```
提供机构:
Ayushnangia
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作