English-Tamarian Parallel Corpus

Name: English-Tamarian Parallel Corpus
Creator: 亚利桑那大学
Published: 2022-10-15 04:35:02
License: 暂无描述

arXiv2022-10-15 更新2024-06-21 收录

下载链接：

https://github.com/cognitiveailab/darmok

下载链接

链接失效反馈

官方服务：

资源简介：

English-Tamarian Parallel Corpus是一个专门为研究虚构语言Tamarian而创建的数据集，由亚利桑那大学的Peter A. Jansen负责。该数据集包含68条从英语到Tamarian的平行语料，这些语料来源于《星际迷航》剧集及衍生小说。数据集的创建过程涉及从Reddit社区讨论中提取Tamarian语句的含义，并结合小说上下文进行推断。该数据集主要用于训练机器翻译模型，特别是针对富含隐喻的语言翻译，旨在解决虚构语言与现实语言之间的翻译难题。

The English-Tamarian Parallel Corpus is a dataset specifically developed for research on the constructed language Tamarian, led by Peter A. Jansen from the University of Arizona. This dataset includes 68 parallel sentence pairs for English-to-Tamarian translation, sourced from Star Trek television episodes and their tie-in novels. The development of this dataset involved extracting the semantic meanings of Tamarian utterances from discussions within Reddit communities and inferring their translations by combining contextual information from the tie-in novels. This dataset is primarily intended for training machine translation models, especially for translating metaphor-rich languages, with the goal of addressing the translation challenges between constructed languages and real-world natural languages.

提供机构：

亚利桑那大学

创建时间：

2021-07-17

5,000+

优质数据集

54 个

任务类型

进入经典数据集