HPLT/HPLT2.0_cleaned

Name: HPLT/HPLT2.0_cleaned
Creator: HPLT
Published: 2025-11-13 15:46:25
License: 暂无描述

Hugging Face2025-11-13 更新2024-12-14 收录

下载链接：

https://hf-mirror.com/datasets/HPLT/HPLT2.0_cleaned

下载链接

链接失效反馈

官方服务：

资源简介：

HPLT Datasets v2.0的cleaned版本是一个包含191种世界语言的大规模网络爬取文档集合，数据主要来源于Internet Archive和Common Crawl。该数据集已转换为Parquet格式，并提供了每种语言的文本量统计，包括段数、单词数、字符数和文档数。

This is a large-scale collection of web-crawled documents in 191 world languages, primarily sourced from the Internet Archive and Common Crawl. The dataset is part of the HPLT project and is available in a cleaned variant, converted to Parquet format. The dataset supports multiple tasks such as fill-mask and text-generation, with a focus on language modeling. The README also includes a table listing the language codes, the amount of text in segments, words, characters, and documents for each language.

提供机构：

HPLT

5,000+

优质数据集

54 个

任务类型

进入经典数据集