UCLNLP/monoweb-dataset
收藏Hugging Face2026-04-23 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/UCLNLP/monoweb-dataset
下载链接
链接失效反馈官方服务:
资源简介:
MonoWeb数据集是一个多语言预训练语料库,源自FineWeb-Edu(英语)和FineWeb2(德语、西班牙语、法语),通过系统性地移除所有混合语言文档而创建。数据集结构清晰,包含完整的源语料库(每种语言60B tokens,总计240B)以及被移除的双语文档。该数据集与一篇研究论文相关联,并提供了预训练模型。
The MonoWeb Dataset is a multilingual pretraining corpus derived from FineWeb-Edu (English) and FineWeb2 (German, Spanish, French) by systematically removing all mixed-language documents. The dataset structure is clearly outlined, including the full source corpora (60B tokens per language, 240B total) and the removed bilingual documents. The dataset is associated with a research paper and pretrained models are available.
提供机构:
UCLNLP



