five

Mandarin-English Code-Switching in South-East Asia

收藏
Mendeley Data2024-01-31 更新2024-06-28 收录
下载链接:
https://catalog.ldc.upenn.edu/LDC2015S04
下载链接
链接失效反馈
官方服务:
资源简介:
Introduction Mandarin-English Code-Switching in South-East Asia was developed by Nanyang Technological University and Universiti Sains Malaysia in Singapore and Malaysia, respectively. It is comprised of approximately 192 hours of Mandarin-English code-switching speech from 156 speakers with associated transcripts. Code-switching refers to the practice of shifting between languages or language varieties during conversation. This corpus focuses on the shift between Mandarin and English by Malaysian and Singaporean speakers. Speakers engaged in unscripted conversations and interviews. In the conversational speech segments, two speakers conversed freely with each other. The interviews consisted of questions from an interviewer and answers from an interviewee; only the interviewee's speech was recorded. Topics discussed range from hobbies, friends, and daily activities. Data The speakers were gender-balanced (49.7% female, 50.3% male) and between 19 and 33 years of age. Over 60% of the speakers were Singaporean; the rest were Malaysian. The speech recordings were conducted in a quiet room using several microphones and recording devices. Details about the recording conditions are contained in the documentation provided with this release. The audio files in this corpus are 16KHz, 16-bit recordings in flac compressed wav format between 20 and 120 minutes in length. Selected segments of the audio recordings were transcribed. Most of those segments contain code-switching utterances. The transcription file for each audio file is stored in UTF-8 tab-separated text file format. Development and Training Divisions are available as a seperate download (SEAME_train_dev_division.zip) and on the provider's Github page. Samples Please view this audio sample and transcript sample. Updates As of 12/14/2015, an additional set of transcription files were added for all the audio. The transcriptions are updated based on the original transcription, with adding the previously un-transcribed utterance. The language label also is also added for each utterance in the transcription. File directories were also changed to reflect the update, specifically, the change is made under /data/{recording_type}/transcript/{phase_number}/ Where - the {recording_type} is equal to 'conversation' or 'interview' - the {phase_number} is equal to 'phaseI' or 'phaseII' +) 'phaseI' contains all the existing transcription from the first release +) 'phaseII' contains the newly updated transcriptions, where some typo mistakes, wrong boundary markers are corrected. Un-transcribed segments, which are normally monolingual and language label for each segment are added. The documentation for the corpus also updated to include the detail description on the new update in section 3) Transcription. Portions © 2015 Nanyang Technical University, Universiti Sains Malaysia, Trustees of the University of Pennsylvania

东南亚英汉代码转换语料库(Mandarin-English Code-Switching in South-East Asia)由新加坡南洋理工大学(Nanyang Technological University)与马来西亚理科大学(Universiti Sains Malaysia)分别开发。该语料库包含来自156名说话者的约192小时英汉代码转换语音数据,并附带对应转写文本。代码转换(Code-switching)指会话过程中在不同语言或语言变体间切换的语言现象,本语料库聚焦马来西亚与新加坡使用者在汉语普通话与英语间的代码转换行为。数据采集场景包含无脚本会话与访谈两类:会话片段为两名说话者自由交流;访谈则仅采集受访对象的应答内容,由访谈者提出问题。讨论话题涵盖爱好、社交与日常活动等。 数据采集对象的性别分布均衡(女性占比49.7%,男性占比50.3%),年龄区间为19至33岁。超过60%的说话者为新加坡籍,其余为马来西亚籍。语音录制在安静房间内完成,使用多款麦克风与录音设备。录制环境的详细说明可参见本次发布附带的文档。本语料库的音频文件均为16kHz、16位的FLAC压缩WAV格式,单段时长介于20至120分钟之间。对音频录制片段进行了选择性转写,其中多数片段包含代码转换语句。每个音频文件对应的转写文本以UTF-8编码的制表符分隔文本格式存储。 开发集与训练集可通过独立下载包SEAME_train_dev_division.zip获取,亦可在提供方的GitHub页面下载。 样本请查看本语料库的音频样本与转写样本。 更新说明:截至2015年12月14日,我们为全部音频新增了转写文件集合。本次更新基于原始转写文本完成,补充了此前未转写的语句,并为每条转写语句添加了语言标签。同时调整了文件目录结构,具体路径格式为:/data/{recording_type}/transcript/{phase_number}/,其中: - {recording_type} 取值为'conversation'(会话)或'interview'(访谈) - {phase_number} 取值为'phaseI'(第一阶段)或'phaseII'(第二阶段) - phaseI 包含首次发布的全部现有转写内容 - phaseII 包含更新后的转写文本,修正了部分拼写错误与边界标记错误,补充了此前未转写的单语片段,并为每个片段添加了语言标签。本次更新同时修订了语料库文档,在第3节“转写”中新增了本次更新的详细说明。 本语料库部分内容 © 2015 南洋理工大学、马来西亚理科大学、宾夕法尼亚大学托管方。
创建时间:
2024-01-31
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作