English-Tamarian Parallel Corpus
收藏arXiv2022-10-15 更新2024-06-21 收录
下载链接:
https://github.com/cognitiveailab/darmok
下载链接
链接失效反馈官方服务:
资源简介:
English-Tamarian Parallel Corpus是一个专门为研究虚构语言Tamarian而创建的数据集,由亚利桑那大学的Peter A. Jansen负责。该数据集包含68条从英语到Tamarian的平行语料,这些语料来源于《星际迷航》剧集及衍生小说。数据集的创建过程涉及从Reddit社区讨论中提取Tamarian语句的含义,并结合小说上下文进行推断。该数据集主要用于训练机器翻译模型,特别是针对富含隐喻的语言翻译,旨在解决虚构语言与现实语言之间的翻译难题。
The English-Tamarian Parallel Corpus is a dataset specifically developed for research on the constructed language Tamarian, led by Peter A. Jansen from the University of Arizona. This dataset includes 68 parallel sentence pairs for English-to-Tamarian translation, sourced from Star Trek television episodes and their tie-in novels. The development of this dataset involved extracting the semantic meanings of Tamarian utterances from discussions within Reddit communities and inferring their translations by combining contextual information from the tie-in novels. This dataset is primarily intended for training machine translation models, especially for translating metaphor-rich languages, with the goal of addressing the translation challenges between constructed languages and real-world natural languages.
提供机构:
亚利桑那大学
创建时间:
2021-07-17



