lapa-llm/pretraining-lower-quality
收藏Hugging Face2025-11-13 更新2025-11-15 收录
下载链接:
https://hf-mirror.com/datasets/lapa-llm/pretraining-lower-quality
下载链接
链接失效反馈官方服务:
资源简介:
Lapa预训练低质量数据集是乌克兰语预训练语料库的一个高质量子集(但低于Lapa预训练高质量数据集)。该数据集通过六个模型进行过滤,测量文本的不同质量方面,包括对错误信息的过滤、文本的语法正确性、教育价值、操纵性以及文本连贯性。数据集由CDF bins 12到18的数据组成,并包含一个综合质量指标formula-score。数据集来源于Kobza、FinePDFs、FineWeb和UberText。
Lapa Pretraining Lower Quality Dataset is a high-quality subset (but lower than Lapa Pretraining High Quality Dataset) of pretraining corpus for the Ukrainian language. The dataset is filtered using six models, measuring different quality aspects of the text, including disinformation filtering, grammatical correctness, educational value, manipulativeness, and text coherence. The dataset consists of CDF bins from 12 to 18 and includes a comprehensive quality metric called formula-score. The sources of the dataset are Kobza, FinePDFs, FineWeb, and UberText.
提供机构:
lapa-llm



