lapa-llm/classifier_source
收藏Hugging Face2025-11-13 更新2025-11-15 收录
下载链接:
https://hf-mirror.com/datasets/lapa-llm/classifier_source
下载链接
链接失效反馈官方服务:
资源简介:
Lapa高质量预训练数据集是从两个数据集中随机抽取的样本,目的是将英语分类器转移到乌克兰语。该数据集用于转移lapa-llm/lapa-v012-pretraining集合中的几个模型,包括fineweb-nemotron-edu-score、fineweb-mixtral-edu-score和fasttext-quality-score。数据集的目的是加强乌克兰语言的语言模型生态系统,并提高乌克兰语使用者的语言技术可访问性。数据来源于Kobza、FinePDFs、FineWeb和UberText。
This dataset is a random sample of both https://huggingface.co/datasets/lapa-llm/pretraining-lower-quality and https://huggingface.co/datasets/lapa-llm/pretraining-high-quality to transfer classifiers from English language to Ukrainian. It was used to transfer the following models from this collection https://huggingface.co/collections/lapa-llm/lapa-v012-pretraining: lapa-llm/fineweb-nemotron-edu-score, lapa-llm/fineweb-mixtral-edu-score, lapa-llm/fasttext-quality-score. The aim is to strengthen the Ukrainian-language LLM ecosystem and improve the accessibility of language technology for Ukrainian speakers, sourced from Kobza, FinePDFs, FineWeb, and UberText.
提供机构:
lapa-llm



