jhu-clsp/mmBERT-pretrain-p1-fineweb2-langs
收藏Hugging Face2025-10-13 更新2025-09-13 收录
下载链接:
https://hf-mirror.com/datasets/jhu-clsp/mmBERT-pretrain-p1-fineweb2-langs
下载链接
链接失效反馈官方服务:
资源简介:
mmBERT预训练数据集P1是用于训练mmBERT模型的第一阶段数据。该数据集是一个多样化的多语言预训练数据混合体,包含了2.3T个令牌。数据集由多种来源组成,包括高质量的多语言网络爬取数据、英语网络爬取数据、代码仓库、学术预印本、问答论坛、数学内容、科学论文、百科全书文章和文学作品等。该数据集覆盖了60种语言,包括高资源、中资源和不同脚本的语料。
mmBERT Pre-training Data P1 is the first phase data used for training the mmBERT model. This dataset is a diverse multilingual pre-training data mixture containing 2.3T tokens. It is composed of various sources including high-quality multilingual web crawl data, English web crawl data, code repositories, academic preprints, Q&A forums, mathematical content, scientific papers, encyclopedia articles, and literature works. The dataset covers 60 languages, including high-resource, mid-resource languages, and languages with different scripts.
提供机构:
jhu-clsp



