Eurolingua/HPLT3-198-500k
收藏Hugging Face2025-11-10 更新2025-11-15 收录
下载链接:
https://hf-mirror.com/datasets/Eurolingua/HPLT3-198-500k
下载链接
链接失效反馈官方服务:
资源简介:
HPLT3 Multilingual JSONL (Subset) 是一个多语言数据集,由HPLT3风格的源数据子集构建而成。数据集包含多种语言,每个语言-脚本代码对应一个JSONL文件,每个文件中的每一行都是一个文档。数据集文档总数为51,366,154。数据集格式为JSON Lines,每个文档占一行。语言按照文档数量分为不同的类别,包括500k、100k-499k、10k-99k、1k-9k和<1k。数据集来源于HPLT3风格的网络提取,文件命名格式为语言代码_脚本.jsonl。
The HPLT3 Multilingual JSONL (Subset) dataset is a multilingual dataset constructed from a subset of HPLT3-style sources. It includes various languages, with each language-script code corresponding to a JSONL file, and each file containing one document per line. The total number of documents in the dataset is 51,366,154. The dataset is formatted as JSON Lines, with one document per line. Languages are categorized by the number of documents into different groups, including 500k, 100k-499k, 10k-99k, 1k-9k, and <1k. The dataset is sourced from HPLT3-style web extractions, with file naming format as language code_script.jsonl.
提供机构:
Eurolingua



