five

KONTATTO v1.0

收藏
hdl.handle.net2025-01-15 收录
下载链接:
http://hdl.handle.net/20.500.12124/78
下载链接
链接失效反馈
官方服务:
资源简介:
Kontatto is a corpus of transcribed and annotated spoken data collected by Silvia Dal Negro at the Free University of Bozen/Bolzano. It consists of almost 150,000 orthographic words divided into 55 recordings involving 97 different speakers for a total of 18 hours of speech. The corpus is multilingual and contains a variety of spontaneously occurring code-mixing patterns. However, language distribution is not even: 80.4% of the corpus is made of Tyrolean words, 11.5% of Italian, 2.6% of the words were classified as Trentino, another 0.8% involved other languages (e.g. Ladin, English, etc.) and, finally, 4.7% of the words are not confidently attributable to any language in particular (e.g. proper names, widespread loanwords, some interjections, etc.). This repository contains the Kontatto-MT corpus subset. The data was collected using a collaborative Map Task, during which two speakers and an interviewer interacted to navigate a physical map in order to reach a given destination. This subcorpus documents a variety of languages and dialects in the dolomite region, including (some) Tyrolean and Trentino dialects, Italian, Cimbrian, Ladin, usually combined in the same dialogue. At present it consists of 35,453 tokens, 73% classified as local German dialect. Kontatto was created within the scope of two projects financed by the Autonomous Province of Bozen-Bolzano between 2011-2014, “Italiano-tedesco: aree storiche di contatto in Sudtirolo e Trentino”, and 2016-2019, “Germanico-Romanzo: discorsi e strutture in contatto nell’area dolomitica”. Over the years, many research assistants and students have contributed to the annotation of the data: Katrin Tartarotti, Mara Leonardi, Marta Ghilardi, Nicole Giaier, Adriana Rasa, Lucia Rossaro, Luigi Parisi and Jay Hevelone. The CLARIN deposit was prepared by Greta Franzini and Luca Ducceschi of Eurac Research.

Kontatto乃由Silvia Dal Negro于博岑-博尔扎诺自由大学收集的转录并标注的口语数据语料库。该语料库包含近15万条正字法单词,分为55个录音,涉及97位不同的说话者,总时长达18小时。语料库为多语种,并包含各种自发出现的代码混合模式。然而,语言分布并不均衡:语料库的80.4%由蒂罗尔语单词构成,11.5%为意大利语,2.6%被归类为特伦蒂诺语,另有0.8%涉及其他语言(例如拉迪诺语、英语等),最后,4.7%的单词无法确定属于特定语言(例如专有名词、广泛借用的外来词、某些感叹词等)。 本仓库包含Kontatto-MT语料库子集。数据是在协作式地图任务中收集的,在此任务中,两位说话者与一名访谈者互动,以在物理地图上导航至指定目的地。该子语料库记录了多洛米蒂地区各种语言和方言,包括(部分)蒂罗尔和特伦蒂诺方言、意大利语、齐姆布里亚语、拉迪诺语,通常在同一对话中混合使用。目前,它包含35,453个标记,其中73%被归类为地方德语方言。 Kontatto是在2011-2014年期间由博岑-博尔扎诺自治省资助的两个项目“意大利-德语:南蒂罗尔和特伦蒂诺的历史接触区域”以及2016-2019年项目“日耳曼-罗曼语:多洛米蒂地区的接触话语与结构”的框架下创建的。多年来,许多研究助理和学生参与了数据的标注工作:Katrin Tartarotti、Mara Leonardi、Marta Ghilardi、Nicole Giaier、Adriana Rasa、Lucia Rossaro、Luigi Parisi和Jay Hevelone。CLARIN存档由Eurac Research的Greta Franzini和Luca Ducceschi准备。
提供机构:
hdl.handle.net
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作