lapa-llm/pretraining-high-quality
收藏Hugging Face2025-11-13 更新2025-11-15 收录
下载链接:
https://hf-mirror.com/datasets/lapa-llm/pretraining-high-quality
下载链接
链接失效反馈官方服务:
资源简介:
Lapa高质量预训练数据集是一个针对乌克兰语的预训练语料子集,通过6个模型筛选,确保了数据的高质量。这些模型评估了数据的对齐性(用于筛选虚假信息)、文本的语法正确性、教育价值、操纵性以及文本连贯性。数据集包含了丰富的特征,如文件名、ID、语言、文本以及多个分数指标,旨在为乌克兰语的LLM生态系统提供支持,并提高语言技术的可及性。
The Lapa High Quality Pretraining Dataset is a high-quality subset of pretraining corpus for the Ukrainian language, filtered through 6 models that assess different quality aspects of the data, including alignment for disinformation, grammatical correctness, educational value, manipulativeness, and text coherence. The dataset features a variety of attributes such as file name, ID, language, text, and multiple score metrics, aiming to strengthen the Ukrainian LLM ecosystem and enhance the accessibility of language technology for Ukrainian speakers.
提供机构:
lapa-llm



