Gulf Arabic Conversational Telephone Speech, Transcripts

Name: Gulf Arabic Conversational Telephone Speech, Transcripts
Creator: Linguistic Data Consortium
Published: 2021-07-01 16:19:04
License: 暂无描述

DataCite Commons2021-07-01 更新2025-04-16 收录

下载链接：

https://catalog.ldc.upenn.edu/LDC2006T15

下载链接

链接失效反馈

官方服务：

资源简介：

<h3>Introduction</h3> Gulf Arabic Conversational Telephone Speech, Transcripts is a database developed by Appen Pty Ltd., Sydney, Australia and contains transcripts of roughly 2,800 min of spontaneous telephone conversations in Colloquial Gulf Arabic. A total of 976 conversation sides from 975 Gulf Arabic speakers are provided (one speaker appears on two distinct calls). The average duration per side is about 5.7 minutes. The data was collected and transcribed in 2004 by Appen Pty Ltd., Sydney, Australia. The corresponding speech files for these transcripts are available in <a href="../../../LDC2006S43">Gulf Arabic Conversational Telephone Speech (LDC2006S43)</a>. <h3>Data</h3> Each transcript file is a tab-delimited flat table, where each line contains information and text for a single contiguous utterance, presented via the following fields: <ol> <li>Beginning time stamp in seconds, in square brackets ("[5.7189]")</li> <li>Ending time stamp in seconds, in square brackets</li> <li>Channel/speaker-ID ("A:" or "B:")</li> <li>"Consonant skeleton" orthography for the utterance, in UTF-8</li> <li>"Diacritized" orthography for the utterance, in ASCII</li> </ol> The ASCII field is the Buckwalter transliteration of the fully "vowelized" (pronunciation) form of the utterance. Within fields 4 and 5, word boundaries are marked by space characters in the normal way, following common practices of Arabic orthographic convention (i.e. all definite articles and many conjunctions and prepositions are attached as prefixes to the following word). Transcript tokens enclosed in single parentheses -- e.g. "(DHk)" -- represent annotation marks for non-speech events or conditions, such as laughter, noise, etc. Multi-token strings within single parentheses involve words in some other language (typically English) or some other Arabic dialect. Double parentheses, either with or without tokens enclosed within them -- e.g. "(())", "((word))", or "((word1 word2))" -- represent regions where the transcriber was unable to tell for sure what was said. The "consonant skeleton" orthography is intended to reflect common orthographic practice in written Arabic (i.e. Modern Standard Arabic (MSA)), but without being bound strictly by the specific spellings of MSA words. That is, there may be novel (dialect-specific) words and changes of consonant quality (hence altered spelling) in words that are cognate between MSA and Gulf Arabic. The "vowelized" orthography is restricted to a character set that allows words to be rendered coherently in Arabic script (with all diacritics present as needed to represent short vowels, etc.), but is intended to reflect the perceived pronunciation of each token. As a result, a given word (type), having multiple occurrences in the text with identical "skeletal" spellings, may have multiple distinct "vowelized" spellings. In some cases, these different spellings simply reflect pronunciation variants, while in other cases, they represent distinct morphological forms (with distinct contextual meanings) where the semantic differences are conveyed solely by the short vowels (i.e. the diacritics). <h3>Samples</h3> Please view this <a href="desc/addenda/LDC2006T15.txt">transcript sample (TXT)</a>. <h3>Updates</h3> None at this time. Portions © 2006 Trustees of the University of Pennsylvania

简介海湾阿拉伯语会话电话语音转写文本数据集由澳大利亚悉尼的Appen Pty Ltd.开发，包含约2800分钟的自发式海湾阿拉伯语电话会话转写文本。数据集共包含976个会话方，来自975名海湾阿拉伯语使用者（其中1名使用者参与了两段不同的通话），单个会话方的平均时长约为5.7分钟。该数据集于2004年由澳大利亚悉尼的Appen Pty Ltd.完成收集与转写。对应语音文件可在《海湾阿拉伯语会话电话语音数据集（LDC2006S43）》中获取。数据说明每份转写文件均为制表符分隔的平面文本表格，每行对应一段连续的话语，包含以下字段： 1. 起始时间戳（单位：秒，格式为方括号包裹，如"[5.7189]"） 2. 结束时间戳（单位：秒，格式为方括号包裹） 3. 通道/说话人ID（格式为"A:"或"B:"） 4. 话语的“辅音骨架”正写法，采用UTF-8编码 5. 话语的“带变音符号”正写法，采用ASCII编码 ASCII字段为该话语完全“元音化”（发音形式）的巴克沃尔特转写法（Buckwalter transliteration）。在第4和第5字段中，词边界以常规空格标记，遵循阿拉伯语正字法的通用惯例——即所有定冠词以及多数连词和介词均作为前缀附着于后续词。用单括号括起的转写Token（如"(DHk)"）代表非语音事件或状态的标注标记，例如笑声、杂音等。单括号内的多Token字符串包含其他语言（通常为英语）或其他阿拉伯语方言的词汇。双括号（无论是否包含Token，如"(())"、"((word))"或"((word1 word2))"）代表转写人员无法确定具体语音内容的区域。 “辅音骨架”正写法旨在反映书面阿拉伯语的通用正字法实践（即现代标准阿拉伯语（Modern Standard Arabic, MSA）），但并不严格遵循现代标准阿拉伯语词汇的特定拼写。也就是说，在现代标准阿拉伯语与海湾阿拉伯语的同源词中，可能存在方言专属的新词以及辅音音质变化（因此拼写发生改变）。 “元音化”正写法被限制在可连贯呈现阿拉伯语文字的字符集范围内（包含所有用于表示短元音等所需的变音符号），旨在反映每个Token的感知发音。因此，某个给定的词（词位）若在文本中多次出现且“骨架”拼写相同，可能存在多种不同的“元音化”拼写。在某些情况下，这些不同拼写仅反映发音变体；而在其他情况下，它们代表不同的形态形式（带有不同的上下文含义），其中语义差异仅由短元音（即变音符号）传达。示例请查看此<转写示例（TXT格式，LDC2006T15.txt）>。更新说明目前暂无更新。部分内容©2006宾夕法尼亚大学托管委员会

提供机构：

Linguistic Data Consortium

创建时间：

2020-11-30

5,000+

优质数据集

54 个

任务类型

进入经典数据集