Fisher Levantine Arabic Conversational Telephone Speech, Transcripts

Name: Fisher Levantine Arabic Conversational Telephone Speech, Transcripts
Creator: Linguistic Data Consortium
Published: 2021-07-01 16:19:45
License: 暂无描述

DataCite Commons2021-07-01 更新2025-04-16 收录

下载链接：

https://catalog.ldc.upenn.edu/LDC2007T04

下载链接

链接失效反馈

官方服务：

资源简介：

<h3>Introduction</h3> Levantine Arabic is spoken along the western Mediterranean coast from Anatolia to the Sinai Peninsula and encompasses the local dialects of Lebanon, Syria and Palestine. There are two distinct varieties: Northern, centered around Syria and Lebanon and Southern, spoken in Jordan and Palestine. Northern Levantine Arabic speakers include approximately 8.8 million speakers in Syria and 6 million speakers in Lebanon. Southern Levantine Arabic speakers include approximately 3.5 million speakers in Jordan, 1.6 million speakers in Palestine and nearly one million speakers in Israel. Fisher Levantine Arabic Conversational Telephone Speech, Transcripts contains transcripts for 279 telephone conversations. The majority of the speakers are from Jordan, Lebanon and Palestine. The corresponding telephone speech is contained in <a href="http://catalog.ldc.upenn.edu/LDC2007S02" rel="nofollow">Fisher Levantine Arabic Conversational Telephone Speech</a>. <table border="1"> <tbody> <tr> <td colspan="2">Speaker Distribution by Region</td> </tr> <tr> <td>Jordan</td> <td>60%</td> </tr> <tr> <td>Palestine</td> <td>15%</td> </tr> <tr> <td>Lebanon</td> <td>15%</td> </tr> <tr> <td>Syria</td> <td>8%</td> </tr> <tr> <td>other</td> <td>2%</td> </tr> </tbody> </table>   The Fisher telephone conversation collection protocol was created at LDC to address a critical need of developers trying to build robust automatic speech recognition (ASR) systems. Previous collection protocols, such as CALLFRIEND and Switchboard-II and the resulting corpora, have been adapted for ASR research but were in fact developed for language and speaker identification respectively. Although the CALLHOME protocol and corpora were developed to support ASR technology, they feature small numbers of speakers making telephone calls of relatively long duration with narrow vocabulary across the collection. CALLHOME conversations are challengingly natural and intimate. Under the Fisher protocol, a very large number of participants each make a few calls of short duration speaking to other participants, whom they typically do not know, about assigned topics. This maximizes inter-speaker variation and vocabulary breadth although it also increases formality. Previous protocols such as CALLHOME, CALLFRIEND and Switchboard relied upon participant activity to drive the collection. Fisher is unique in being platform driven rather than participant driven. Participants who wish to initiate a call may do so however the collection platform initiates the majority of calls. Participants need only answer their phones at the times they specified when registering for the study. To encourage a broad range of vocabulary, Fisher participants are asked to speak on an assigned topic which is selected at random from a list, which changes every 24 hours and which is assigned to all subjects paired on that day. Some topics are inherited or refined from previous Switchboard studies while others were developed specifically for the Fisher protocol. <h3>Data</h3> The transcripts were created with green and yellow layers using LDC's Multi-Dialectal Transcription Tool (AMADAT). The green layer seeks to anchor dialectal forms to similar or related Modern Standard Arabic orothgraphy-based forms. The yellow layer is a more careful and detailed transcription that adds functionally necessary vowels and marks important sociolinguistic variations and morphophonemic features. The green-layer transcripts in this corpus are a subset of the transcripts contained in <a href="http://catalog.ldc.upenn.edu/LDC2006T07" rel="nofollow">Levantine Arabic QT Training Data Set 5, Transcripts, LDC2006T07</a>. The yellow-layer transcription was added in this release. <h3>Samples</h3> For an example of the text contained in this corpus, please view this <a href="desc/addenda/LDC2007T04.jpg" rel="nofollow">image of the transcriptions</a> (jpeg format). Portions © 2003-2007 Trustees of the University of Pennsylvania

<h3>简介</h3> 黎凡特阿拉伯语通行于从安纳托利亚到西奈半岛的西地中海沿岸区域，涵盖黎巴嫩、叙利亚及巴勒斯坦的本土方言。该语言存在两大变体：以叙利亚和黎巴嫩为核心的北部黎凡特阿拉伯语，以及通行于约旦和巴勒斯坦的南部黎凡特阿拉伯语。北部黎凡特阿拉伯语使用者约有880万分布于叙利亚，600万分布于黎巴嫩。南部黎凡特阿拉伯语使用者约有350万分布于约旦，160万分布于巴勒斯坦，以色列境内使用者近100万。 《Fisher黎凡特阿拉伯语会话电话语音及转写文本》（Fisher Levantine Arabic Conversational Telephone Speech, Transcripts）包含279通电话会话的转写文本，多数说话者来自约旦、黎巴嫩与巴勒斯坦。对应的电话语音数据收录于<a href="http://catalog.ldc.upenn.edu/LDC2007S02" rel="nofollow">Fisher黎凡特阿拉伯语会话电话语音</a>数据集。 <table border="1"> <tbody> <tr> <td colspan="2">按区域划分的说话者分布</td> </tr> <tr> <td>约旦</td> <td>60%</td> </tr> <tr> <td>巴勒斯坦</td> <td>15%</td> </tr> <tr> <td>黎巴嫩</td> <td>15%</td> </tr> <tr> <td>叙利亚</td> <td>8%</td> </tr> <tr> <td>其他</td> <td>2%</td> </tr> </tbody> </table>   Fisher电话会话采集协议由语言数据联盟（Linguistic Data Consortium, LDC）制定，旨在满足开发者构建鲁棒性自动语音识别（Automatic Speech Recognition, ASR）系统的迫切需求。此前的采集协议如CALLFRIEND、Switchboard-II及其配套语料库虽已适配自动语音识别研究，但它们原本分别是为语言识别与说话人识别开发的。尽管CALLHOME协议及语料库是为支持自动语音识别技术研发而设计，但其仅包含少量说话者，且通话时长较长、词汇覆盖范围较窄。CALLHOME会话的自然度与私密性较高，对模型构建颇具挑战性。在Fisher协议框架下，大量参与者各自进行数次短时通话，通话对象通常为互不相识的其他参与者，通话主题为指定话题。该设计最大化了说话者间的差异与词汇覆盖广度，但同时也提升了对话的正式程度。 此前如CALLHOME、CALLFRIEND及Switchboard等协议均依赖参与者主动发起活动来推进数据采集，而Fisher的独特之处在于其采用平台驱动模式，而非参与者驱动模式。有意发起通话的参与者可自行操作，但多数通话由采集平台主动发起。参与者仅需在注册研究时指定的时段接听电话即可。 为覆盖广泛的词汇范围，Fisher项目要求参与者从每日更新的随机话题列表中选取指定主题进行发言，当日配对的所有受试者将使用同一批话题。部分话题继承自此前的Switchboard研究并加以优化，其余话题则专为Fisher协议开发。 <h3>数据</h3> 本数据集的转写文本采用语言数据联盟（LDC）的多方言转写工具（Multi-Dialectal Transcription Tool, AMADAT）制作了绿色层与黄色层标注。绿色层旨在将方言形式映射至与之相似或相关的现代标准阿拉伯语（Modern Standard Arabic）正写法形式。黄色层则是更为细致严谨的转写，补充了功能必需的元音，并标记了重要的社会语言学变体与音位形态特征。 本语料库中的绿色层转写文本是<a href="http://catalog.ldc.upenn.edu/LDC2006T07" rel="nofollow">《黎凡特阿拉伯语QT训练数据集5：转写文本》（Levantine Arabic QT Training Data Set 5, Transcripts, LDC2006T07）</a>中转写文本的子集，黄色层转写为本版本新增内容。 <h3>示例</h3> 如需查看本语料库中的文本示例，请查看该<a href="desc/addenda/LDC2007T04.jpg" rel="nofollow">转写文本图片</a>（JPEG格式）。 部分内容 © 2003-2007 宾夕法尼亚大学董事会。

提供机构：

Linguistic Data Consortium

创建时间：

2020-11-30

搜集汇总

数据集介绍