Goader/kobza
收藏Hugging Face2025-07-28 更新2025-10-25 收录
下载链接:
https://hf-mirror.com/datasets/Goader/kobza
下载链接
链接失效反馈官方服务:
资源简介:
Kobza是一个目前为止最大的公开可用乌克兰语语料库,包含近600亿个token,跨越9700万份文档。它旨在支持乌克兰语大型语言模型(LLM)的预训练和微调,以及乌克兰语在多语种设置中的低代表性。该语料库从广泛的网络来源中汇集高质量乌克兰语文本,并应用严格的去重步骤以确保语言建模任务的高效性。每个文档都包含源、子源、时间戳和URL等元数据,以便进行灵活过滤。
Kobza is the largest publicly available Ukrainian corpus to date, comprising nearly 60 billion tokens across 97 million documents. It is designed to support pretraining and fine-tuning of large language models (LLMs) in Ukrainian, as well as multilingual settings where Ukrainian is underrepresented. The corpus aggregates high-quality Ukrainian text from a wide range of web sources and applies rigorous deduplication steps to ensure high utility for language modeling tasks.
提供机构:
Goader



