越南语通用文本语料库

Name: 越南语通用文本语料库
Creator: 上海库帕思科技有限公司
Published: 2026-04-28 20:02:39
License: 暂无描述

国家数据集管理服务平台2026-04-28 更新2026-04-29 收录

下载链接：

https://www.ndsms.cn/dataRetrieval/datasetDetail/?id=78f1f6e4c7381d719fa943268c84ed80

下载链接

链接失效反馈

官方服务：

资源简介：

本数据集面向越南语大语言模型的训练与迭代，以海量数据驱动模型性能跃升。提供高达12亿条越南语文本，是当前规模最大的越南语训练语源之一。该体量可支撑训练十亿至百亿参数级别的越南语专用大模型，显著提升其在长文本生成、多轮对话及领域迁移中的稳定性。数据处理过程针对越南语声调符号和复合词边界进行了专门保持，避免预训练中的字符信息丢失。

This dataset is designed for the training and iterative optimization of Vietnamese large language models (LLMs), leveraging massive volumes of data to significantly boost model performance. It contains up to 1.2 billion Vietnamese text instances, making it one of the largest Vietnamese training corpora currently available. This scale enables the training of Vietnamese-specialized LLMs with parameter sizes ranging from 1 billion to 10 billion, greatly improving their stability in long text generation, multi-turn dialogue, and domain adaptation. The data processing pipeline has been specially configured to preserve Vietnamese tone marks and compound word boundaries, preventing the loss of character-level information during pre-training.

提供机构：

上海库帕思科技有限公司

创建时间：

2026-04-27

搜集汇总

数据集介绍

背景与挑战

背景概述

本数据集是一个专为越南语大语言模型训练设计的大规模语料库，提供高达12亿条越南语文本，是目前最大的越南语训练语源之一，能支持十亿至百亿参数级别模型的训练。它针对越南语声调符号和复合词边界进行了专门处理，以提升模型在长文本生成、多轮对话及领域迁移中的性能，适用于金融、法律、客服等垂直行业的AI能力建设。

以上内容由遇见数据集搜集并总结生成