jhu-clsp/mmBERT-pretrain-p3-others
收藏Hugging Face2025-10-13 更新2025-09-13 收录
下载链接:
https://hf-mirror.com/datasets/jhu-clsp/mmBERT-pretrain-p3-others
下载链接
链接失效反馈官方服务:
资源简介:
mmBERT预训练数据P3是用于训练mmBERT模型套件的多语言预训练数据混合,包含了2.3T tokens的训练数据。该数据集由多种数据源组成,包括高质量的多语言网络爬虫数据、英文网络爬虫数据、代码仓库和文件、学术预印本、问答论坛、指令跟随数据、数学内容、科学论文、百科全书文章和文学及参考书籍等。数据集覆盖了60种语言,并按照资源丰富程度分为高资源、中资源语言。该数据集以MDS格式提供,可用于Composer和ModernBERT训练库。
mmBERT pre-training data P3 is a diverse multilingual pre-training data mixture used to train the mmBERT model suite, containing 2.3T tokens of training data. The dataset is composed of various sources including high-quality multilingual web crawl data, English web crawl data, code repositories and files, academic preprints, Q&A forums, instruction-following data, mathematical content, scientific papers, encyclopedia articles, and literature and reference books. The dataset covers 60 languages and is categorized by resource richness into high-resource and mid-resource languages. The dataset is provided in MDS format and is ready for use with Composer and the ModernBERT training repository.
提供机构:
jhu-clsp



