jhu-clsp/mmBERT-pretrain-p3-others

Name: jhu-clsp/mmBERT-pretrain-p3-others
Creator: jhu-clsp
Published: 2025-10-13 18:56:19
License: 暂无描述

Hugging Face2025-10-13 更新2025-09-13 收录

下载链接：

https://hf-mirror.com/datasets/jhu-clsp/mmBERT-pretrain-p3-others

下载链接

链接失效反馈

官方服务：

资源简介：

mmBERT预训练数据P3是用于训练mmBERT模型套件的多语言预训练数据混合，包含了2.3T tokens的训练数据。该数据集由多种数据源组成，包括高质量的多语言网络爬虫数据、英文网络爬虫数据、代码仓库和文件、学术预印本、问答论坛、指令跟随数据、数学内容、科学论文、百科全书文章和文学及参考书籍等。数据集覆盖了60种语言，并按照资源丰富程度分为高资源、中资源语言。该数据集以MDS格式提供，可用于Composer和ModernBERT训练库。

mmBERT pre-training data P3 is a diverse multilingual pre-training data mixture used to train the mmBERT model suite, containing 2.3T tokens of training data. The dataset is composed of various sources including high-quality multilingual web crawl data, English web crawl data, code repositories and files, academic preprints, Q&A forums, instruction-following data, mathematical content, scientific papers, encyclopedia articles, and literature and reference books. The dataset covers 60 languages and is categorized by resource richness into high-resource and mid-resource languages. The dataset is provided in MDS format and is ready for use with Composer and the ModernBERT training repository.

提供机构：

jhu-clsp

5,000+

优质数据集

54 个

任务类型

进入经典数据集