UCLNLP/monoweb-dataset

Name: UCLNLP/monoweb-dataset
Creator: UCLNLP
Published: 2026-04-23 06:17:42
License: 暂无描述

Hugging Face2026-04-23 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/UCLNLP/monoweb-dataset

下载链接

链接失效反馈

官方服务：

资源简介：

MonoWeb数据集是一个多语言预训练语料库，源自FineWeb-Edu（英语）和FineWeb2（德语、西班牙语、法语），通过系统性地移除所有混合语言文档而创建。数据集结构清晰，包含完整的源语料库（每种语言60B tokens，总计240B）以及被移除的双语文档。该数据集与一篇研究论文相关联，并提供了预训练模型。

The MonoWeb Dataset is a multilingual pretraining corpus derived from FineWeb-Edu (English) and FineWeb2 (German, Spanish, French) by systematically removing all mixed-language documents. The dataset structure is clearly outlined, including the full source corpora (60B tokens per language, 240B total) and the removed bilingual documents. The dataset is associated with a research paper and pretrained models are available.

提供机构：

UCLNLP

5,000+

优质数据集

54 个

任务类型

进入经典数据集