afrizalha/Centhini-1-Javanese
收藏Hugging Face2025-02-08 更新2025-02-15 收录
下载链接:
https://hf-mirror.com/datasets/afrizalha/Centhini-1-Javanese
下载链接
链接失效反馈官方服务:
资源简介:
Centhini数据集是包含529,575个预训练样本的Javanese语言数据集,主要包含通过Deepseek V3生成的翻译文本,包括来自英语Fineweb的翻译和印尼语mc4数据集的释义翻译。此外,还包括古老的Javanese文本如Serat Centhini和babad Tanah Djawi,以及Javanese维基百科等开放文本。该数据集是目前最大的免费且开源的Javanese语言数据集,共有超过5400万个单词。为了便于模型学习,数据集中的e字母变音符号已被移除。
The Centhini dataset consists of 529,575 pretraining examples for the Javanese language, predominantly featuring translations generated by Deepseek V3, including translations from English Fineweb and paraphrased translations from the Indonesian mc4 dataset. The dataset also includes ancient Javanese texts like Serat Centhini and Babad Tanah Djawi, as well as open texts such as the Javanese Wikipedia. It is currently the largest freely accessible aggregated dataset for the Javanese language available for free and open source, containing over 54 million words. Diacritical marks from the letter e have been removed for consistency and to facilitate model learning.
提供机构:
afrizalha



