HPLT/HPLT2.0_cleaned
收藏Hugging Face2025-11-13 更新2024-12-14 收录
下载链接:
https://hf-mirror.com/datasets/HPLT/HPLT2.0_cleaned
下载链接
链接失效反馈官方服务:
资源简介:
HPLT Datasets v2.0的cleaned版本是一个包含191种世界语言的大规模网络爬取文档集合,数据主要来源于Internet Archive和Common Crawl。该数据集已转换为Parquet格式,并提供了每种语言的文本量统计,包括段数、单词数、字符数和文档数。
This is a large-scale collection of web-crawled documents in 191 world languages, primarily sourced from the Internet Archive and Common Crawl. The dataset is part of the HPLT project and is available in a cleaned variant, converted to Parquet format. The dataset supports multiple tasks such as fill-mask and text-generation, with a focus on language modeling. The README also includes a table listing the language codes, the amount of text in segments, words, characters, and documents for each language.
提供机构:
HPLT



