Griko-Italian speech translation corpus
收藏arXiv2018-07-28 更新2024-06-21 收录
下载链接:
http://griko.project.uoi.gr
下载链接
链接失效反馈官方服务:
资源简介:
Griko-Italian speech translation corpus是一个针对濒危语言Griko的小型平行语料库,由格勒诺布尔信息实验室创建。该数据集包含330条语音记录,总时长约20分钟,每条记录均配有意大利语翻译和词级语音到转录及翻译的对齐标注。数据集还包括形态句法标签和词级注释,以及通过自动单元发现方法生成的伪电话。此数据集旨在支持计算语言文档研究,特别是在零资源任务中,如语音到翻译对齐和无监督词发现。
The Griko-Italian speech translation corpus is a small parallel corpus targeting the endangered language Griko, created by the Grenoble Information Laboratory. This dataset comprises 330 speech recordings with a total duration of approximately 20 minutes. Each recording is accompanied by an Italian translation, as well as word-level alignment annotations between the speech signal, its transcript, and the translation. Additionally, the dataset includes morphosyntactic tags, word-level annotations, and pseudo-phones generated through automatic unit discovery methods. This corpus aims to support computational linguistic documentation research, particularly for zero-resource tasks such as speech-to-translation alignment and unsupervised word discovery.
提供机构:
格勒诺布尔信息实验室,格勒诺布尔阿尔卑斯大学,法国
创建时间:
2018-07-28



