five

FrancophonIA/gatitos

收藏
Hugging Face2025-03-30 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/FrancophonIA/gatitos
下载链接
链接失效反馈
官方服务:
资源简介:
GATITOS是一个高质量的多语言平行数据集,包含4000个英文片段,每个片段都被翻译成了173种语言,其中170种是低资源语言,3种是中等偏高的资源语言(西班牙语、法语、印地语)。这个数据集主要用于训练和改进机器翻译模型。它主要由短片段组成,93%是单个词,只有0.6%的句子超过5个词。因此,它更适合作为一个多语言词汇表,而不是一个平行训练语料库。数据集的源文本是英语中的常用词,以及一些常见短语和短句。它还包含了良好的数字、月份、星期、斯瓦德什词和语言本身的名称(包括本族语名称)的覆盖。

The GATITOS (Googles Additional Translations Into Tail-languages: Often Short) dataset is a high-quality, multi-way parallel dataset of tokens and short phrases, intended for training and improving machine translation models. It consists of 4,000 English segments translated into 173 languages, primarily focusing on low-resource languages. The dataset is characterized by short segments, with 93% being single tokens and only 0.6% having more than 5 tokens. It is best suited as a multilingual lexicon rather than a parallel training corpus, containing frequent English words, common phrases, and short sentences. The dataset also includes annotations and definitions for some words, reflecting the complexity of translating single tokens due to polysemy.
提供机构:
FrancophonIA
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作