jobs-git/HPLT2.0_cleaned
收藏Hugging Face2025-03-07 更新2025-08-30 收录
下载链接:
https://hf-mirror.com/datasets/jobs-git/HPLT2.0_cleaned
下载链接
链接失效反馈官方服务:
资源简介:
这是一个大规模的网页爬取文档集合,包含191种世界语言的文档,由HPLT项目制作。数据主要来源于互联网档案馆,并补充了一些来自Common Crawl的数据。这个数据集是HPLT Datasets v2.0的清理版本,转换为Parquet格式。HuggingFace团队已经比较了各种多语言语料库在训练大型语言模型方面的效用,发现HPLT v2数据集的性能与FineWeb 2相当,在某些语言上甚至更好。内部评估也证实了HPLT v2数据集的质量比HPLT v1.2数据集要好得多。
This is a large-scale collection of web-crawled documents in 191 world languages, produced by the HPLT project. The source of the data is mostly Internet Archive with some additions from Common Crawl. This dataset is the cleaned variant of the HPLT Datasets v2.0 converted to the Parquet format. HuggingFace team has compared the utility of various multilingual corpora for training large language models, finding that HPLT v2 datasets are on par with FineWeb 2 in terms of performance.
提供机构:
jobs-git



