lapa-llm/classifier_source

Name: lapa-llm/classifier_source
Creator: lapa-llm
Published: 2025-11-13 17:30:44
License: 暂无描述

Hugging Face2025-11-13 更新2025-11-15 收录

下载链接：

https://hf-mirror.com/datasets/lapa-llm/classifier_source

下载链接

链接失效反馈

官方服务：

资源简介：

Lapa高质量预训练数据集是从两个数据集中随机抽取的样本，目的是将英语分类器转移到乌克兰语。该数据集用于转移lapa-llm/lapa-v012-pretraining集合中的几个模型，包括fineweb-nemotron-edu-score、fineweb-mixtral-edu-score和fasttext-quality-score。数据集的目的是加强乌克兰语言的语言模型生态系统，并提高乌克兰语使用者的语言技术可访问性。数据来源于Kobza、FinePDFs、FineWeb和UberText。

This dataset is a random sample of both https://huggingface.co/datasets/lapa-llm/pretraining-lower-quality and https://huggingface.co/datasets/lapa-llm/pretraining-high-quality to transfer classifiers from English language to Ukrainian. It was used to transfer the following models from this collection https://huggingface.co/collections/lapa-llm/lapa-v012-pretraining: lapa-llm/fineweb-nemotron-edu-score, lapa-llm/fineweb-mixtral-edu-score, lapa-llm/fasttext-quality-score. The aim is to strengthen the Ukrainian-language LLM ecosystem and improve the accessibility of language technology for Ukrainian speakers, sourced from Kobza, FinePDFs, FineWeb, and UberText.

提供机构：

lapa-llm

5,000+

优质数据集

54 个

任务类型

进入经典数据集