iryneko571/CCMatrix-v1-Ja_Zh-fused
收藏Hugging Face2024-07-01 更新2024-07-06 收录
下载链接:
https://hf-mirror.com/datasets/iryneko571/CCMatrix-v1-Ja_Zh-fused
下载链接
链接失效反馈官方服务:
资源简介:
该数据集是从larryvrh/CCMatrix-v1-Ja_Zh-filtered派生而来,主要用于训练新的mt5-base模型。数据集包含日语和中文的翻译任务,针对原始数据集进行了一些修改以解决模型在翻译过程中遇到的问题。具体修改包括:融合短句以增加句子长度、过滤非单词或数字不匹配的数据、移除未翻译的日语字符等。这些修改旨在提高模型的翻译质量和处理长句的能力。
This dataset is derived from larryvrh/CCMatrix-v1-Ja_Zh-filtered and has been modified for training the new mt5-base model. The main modifications include: 1) merging some sentences to address the issue of repetition or stopping in the translation model when dealing with short sentences; 2) filtering out data where the number of non-words does not match on both sides; 3) removing description-like samples containing untranslated Japanese characters. These modifications aim to improve the translation quality and the ability to handle longer sentences.
提供机构:
iryneko571



