HPLT/HPLT3.0
收藏Hugging Face2025-11-14 更新2025-11-15 收录
下载链接:
https://hf-mirror.com/datasets/HPLT/HPLT3.0
下载链接
链接失效反馈官方服务:
资源简介:
这是一个包含198种世界语言的网络爬虫文档的大规模集合,由HPLT项目提供。数据集来源于互联网档案馆和通用网络爬虫,涵盖了广泛的语种,并包含了大量的数据。数据集没有托管在HuggingFace上,需要特定的下载说明。数据按照文档质量进行组织,并包括丰富的注释和元数据。文件还描述了项目的许可、数据处理和资金详情。
This is a large-scale collection of web-crawled documents in 198 world languages, produced by the HPLT project. The dataset is derived from the Internet Archive and Common Crawl, covering a wide range of languages and containing a vast amount of data. The dataset is not hosted on HuggingFace and requires specific instructions for downloading. The data is organized by document quality and includes rich annotations and metadata. The file also describes the licensing, data processing, and funding details of the project.
提供机构:
HPLT



