five

HPLT/HPLT2.0_cleaned

收藏
Hugging Face2025-11-13 更新2024-12-14 收录
下载链接:
https://hf-mirror.com/datasets/HPLT/HPLT2.0_cleaned
下载链接
链接失效反馈
官方服务:
资源简介:
HPLT Datasets v2.0的cleaned版本是一个包含191种世界语言的大规模网络爬取文档集合,数据主要来源于Internet Archive和Common Crawl。该数据集已转换为Parquet格式,并提供了每种语言的文本量统计,包括段数、单词数、字符数和文档数。

This is a large-scale collection of web-crawled documents in 191 world languages, primarily sourced from the Internet Archive and Common Crawl. The dataset is part of the HPLT project and is available in a cleaned variant, converted to Parquet format. The dataset supports multiple tasks such as fill-mask and text-generation, with a focus on language modeling. The README also includes a table listing the language codes, the amount of text in segments, words, characters, and documents for each language.
提供机构:
HPLT
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作