five

proxectonos/Finetuning-MT

收藏
Hugging Face2025-12-17 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/proxectonos/Finetuning-MT
下载链接
链接失效反馈
官方服务:
资源简介:
该数据集采用多语言架构,专门设计用于加强加利西亚-葡萄牙语系统的语法理解。在初始部分,它使用加利西亚语和葡萄牙语作为枢纽语言连接其他伊比利亚语言和英语,这种结构优先学习加利西亚-葡萄牙语的语言细节,而不是其他语言对。最后部分包含用于翻译相关任务的指令调整数据集,如后编辑、性别评估和实体识别,这些任务原本在加利西亚语中不可用。为了弥补这一差距,这些材料通过结合葡萄牙语的词汇适应和本地化、使用Apertium符号翻译器以及最终的后编辑阶段生成高质量的加利西亚语材料。数据集还包括来自MT语料库的段落和句子,以及来自TowerBlocks的数据集。

This dataset features a multilingual architecture specifically designed to strengthen the grammatical understanding of the Galician-Portuguese system. In its initial sections, it employs Galician and Portuguese as pivot languages to connect other Iberian languages and English, a structure that prioritizes the learning of Galician-Portuguese linguistic nuances over other language pairs. The final section incorporates instruction-tuning datasets for translation-related tasks—such as post-editing, gender evaluation, and entity recognition—originally unavailable in Galician. To bridge this gap, these materials were generated through a pipeline combining lexical adaptation and localization from Portuguese, the use of the Apertium symbolic translator, and a final normalization post-editing phase to ensure high-quality, idiomatic Galician. The dataset also includes paragraphs and sentences from the MT corpus, as well as datasets from TowerBlocks.
提供机构:
proxectonos
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作