塞尔维亚语通用文本语料库

Name: 塞尔维亚语通用文本语料库
Creator: 上海库帕思科技有限公司
Published: 2026-04-28 20:02:39
License: 暂无描述

国家数据集管理服务平台2026-04-28 更新2026-04-29 收录

下载链接：

https://www.ndsms.cn/dataRetrieval/datasetDetail/?id=fec6041ab8ad1d2b1237b006213fc172

下载链接

链接失效反馈

官方服务：

资源简介：

本数据集面向需要覆盖巴尔干地区语言的多语言AI项目，为塞尔维亚语这类低资源语言提供宝贵的训练数据。包含0.51亿条塞尔维亚语文本，覆盖日常表达及基础领域文档。尽管规模相对有限，但本数据集聚焦塞尔维亚语特有的西里尔/拉丁双文字系统及阴阳性变位，经过专项清洗与对齐，可作为多语言模型增量训练或特定任务微调的核心语料，提升模型在该语言上的基础理解能力。

This dataset is tailored for multilingual AI projects that need to cover languages of the Balkan Peninsula, providing valuable training data for low-resource languages such as Serbian. It contains 51 million Serbian text entries covering daily expressions and basic domain documents. Although its scale is relatively limited, this dataset focuses on Serbian's unique dual writing system of Cyrillic and Latin scripts as well as its masculine and feminine inflections. After undergoing specialized cleaning and alignment, it can serve as a core corpus for incremental training of multilingual models or fine-tuning for specific tasks, thereby improving the model's basic language understanding capabilities in this language.

提供机构：

上海库帕思科技有限公司

创建时间：

2026-04-27

搜集汇总

数据集介绍

背景与挑战

背景概述

该数据集是一个塞尔维亚语通用文本语料库，专为多语言AI项目设计，旨在为低资源语言塞尔维亚语提供0.51亿条训练数据，覆盖日常表达和基础领域文档。它聚焦塞尔维亚语特有的西里尔/拉丁双文字系统和阴阳性变位，经过专项清洗与对齐，可作为多语言模型增量训练或任务微调的核心语料，以提升模型在该语言上的基础理解能力。

以上内容由遇见数据集搜集并总结生成