Eurolingua/hplt3_edu_scores
收藏Hugging Face2026-03-23 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/Eurolingua/hplt3_edu_scores
下载链接
链接失效反馈官方服务:
资源简介:
HPLT3-JQL-Education是HPLT3数据集的一个模型标注的语言子集,涵盖36种语言。通过模型标注实现了对教育类样本的高质量筛选,而不会过度减少数据量。该数据集基于Snowflake的Arctic-embed-m-v2.0嵌入训练的深度学习分类器评分创建。数据集包含Gemma、Llama和Mistral三种基于Snowflake分类器的质量评分,以及原始HPLT3文档ID。数据来源于2012至2024年的网络内容,可能包含个人身份信息。
HPLT3-JQL-Education is a model-annotated language subset of HPLT3, spanning 36 languages. The model-annotations allow for a filtering that achieves higher-quality training outcomes without excessively aggressive data reduction. The dataset was created based on scores assigned by a deep learning classifier trained to identify educational samples using Snowflakes Arctic-embed-m-v2.0 embeddings. It includes quality scores from Gemma, Llama, and Mistral-based Snowflake classifiers, along with original HPLT3 document IDs. The data originates from web content collected from 2012 to 2024 and may contain personally identifiable information.
提供机构:
Eurolingua



