ZAEBUC-Spoken

arXiv2024-03-27 更新2024-08-06 收录

下载链接：

http://arxiv.org/abs/2403.18182v1

下载链接

链接失效反馈

官方服务：

资源简介：

我们介绍了ZAEBUC-Spoken，一个多语言多方言的阿拉伯语-英语语音语料库。该语料库包含12小时的Zoom会议录音，涉及多个角色扮演工作场景的学生，他们在某个主题上进行头脑风暴，并与对话者进行讨论。会议涵盖不同主题，并分为不同语言设置的阶段。该语料库为自动语音识别（ASR）提供了一个具有挑战性的集合，包括两种语言（阿拉伯语和英语），阿拉伯语以多种变体（现代标准阿拉伯语、海湾阿拉伯语和埃及阿拉伯语）和英语使用各种口音。此外，语料库中还存在这些语言和方言之间的代码转换。作为我们工作的一部分，我们从已建立的转录指南中汲取灵感，提出了一套处理会话语音、代码转换和两种语言正字法问题的指南。我们进一步丰富了语料库，增加了两层注释；(1) 语料库中混合不同阿拉伯语变体部分的方言水平注释，(2) 自动形态学注释，包括分词、词形还原和词性标注。

We introduce ZAEBUC-Spoken, a multilingual and multi-dialectal Arabic-English speech corpus. This corpus contains 12 hours of Zoom meeting recordings, involving students in multiple role-playing workplace scenarios who brainstorm on specific topics and hold discussions with their conversational partners. The meetings cover diverse themes and are divided into stages with different language settings. This corpus serves as a challenging benchmark for automatic speech recognition (ASR), covering two languages: Arabic and English. Arabic appears in multiple varieties, including Modern Standard Arabic, Gulf Arabic, and Egyptian Arabic, while English is used with various accents. Furthermore, code-switching between these languages and dialects is present in the corpus. As part of this work, we draw inspiration from established transcription guidelines and propose a set of guidelines to address issues in conversational speech, code-switching, and orthography for both languages. We further enrich the corpus by adding two layers of annotations: (1) dialect-level annotations for segments mixing different Arabic varieties within the corpus, and (2) automatic morphological annotations including word segmentation, lemmatization, and part-of-speech tagging.

创建时间：

2024-03-27

5,000+

优质数据集

54 个

任务类型

进入经典数据集