five

CIEMPIESS Balance

收藏
DataCite Commons2021-07-01 更新2025-04-16 收录
下载链接:
https://catalog.ldc.upenn.edu/LDC2018S11
下载链接
链接失效反馈
官方服务:
资源简介:
<h3>Introduction</h3><br> <p>CIEMPIESS (Corpus de Investigaci&oacute;n en Espa&ntilde;ol de M&eacute;xico del Posgrado de Ingenier&iacute;a El&eacute;ctrica y Servicio Social) Balance was developed by the Development of Speech Technologies program at the <a href="http://www.ingenieria.unam.mx">School of Engineering</a> at the <a href="http://www.unam.mx/">National Autonomous University of Mexico</a> (UNAM) and consists of approximately 18 hours of Mexican Spanish broadcast speech with associated transcripts. The goal of this work was to create acoustic models for automatic speech recognition. For more information and documentation see the <a href="http://www.CIEMPIESS.org/">CIEMPIESS-UNAM Project website</a>.</p><br> <p>CIEMPIESS Balance is a companion corpus to CIEMPIESS Light, released by LDC as <a href="../../../LDC2017S23">LDC2017S23</a>. It was developed so that the data sets together constitute a gender-balanced corpus. The gender breakdown in CIEMPIESS Light is approximately 75% male and 25% female. In CIEMPIESS Balance the gender breakdown is approximately 25% male and 75% female.</p><br> <p>LDC has released the following data sets in the CIEMPIESS series:</p><br> <ul><br> <li>CIEMPIESS (<a href="../../../LDC2015S07">LDC2015S07</a>)</li><br> <li>CHM150 (<a href="../../../LDC2016S04">LDC2016S04</a>)</li><br> <li>CIEMPIESS Light (<a href="../../../LDC2017S23">LDC2017S23</a>)</li><br> <li>CIEMPIESS Experimentation (<a href="../../../LDC2019S07">LDC2019S07</a>)</li><br> </ul><br> <h3>Data</h3><br> <p>The majority of the speech recordings were collected from <a href="http://www.derecho.unam.mx/cultura-juridica/radio.php">Radio-IUS</a>, a UNAM radio station. Other recordings were taken from <a href="https://www.youtube.com/user/DEDUNAM/videos">IUS Canal Multimedia</a> and <a href="https://www.youtube.com/channel/UCTxkzdUd0tiXT5BN5o6Xo-A/videos">Centro Universitario de Estudios Jur&iacute;dicos</a> (CUEJ UNAM). These two channels feature videos with speech around legal issues and topics related to UNAM.</p><br> <p>The audio files are presented as 16 kHz, 16-bit PCM flac format for this release. Transcripts are presented as UTF-8 encoded plain text.</p><br> <h3>Samples</h3><br> <p>Please view this <a href="desc/addenda/LDC2018S11.flac">speech sample</a> and <a href="desc/addenda/LDC2018S11.txt">transcript sample</a>.</p><br> <h3>Acknowledgements</h3><br> <p>The authors would like to thank to Alejandro V. Mena, Elena Vera and Ang&eacute;lica Guti&eacute;rrez for their support to the social service program: "Desarrollo de Tecnolog&iacute;as del Habla." They also thank the social service students for their hard work.</p><br> <h3>Updates</h3><br> <p>None at this time.</p></br> Portions © 2018 Carlos Daniel Hernández Mena, © 2018 Universidad Nacional Autónoma de México, © 2018 Trustees of the University of Pennsylvania

<h3>引言</h3><br> <p>CIEMPIESS平衡语料库(全称:墨西哥电机工程研究生与社会服务西班牙语研究语料库,Corpus de Investigación en Español de México del Posgrado de Ingeniería Eléctrica y Servicio Social, CIEMPIESS)由墨西哥国立自治大学(National Autonomous University of Mexico, UNAM)工程学院的语音技术开发项目组研发,该学院官网为<a href="http://www.ingenieria.unam.mx">工程学院官网</a>,墨西哥国立自治大学官网为<a href="http://www.unam.mx/">墨西哥国立自治大学官网</a>。该语料库包含约18小时墨西哥西班牙语广播语音数据及对应转录文本,其研发目标是为自动语音识别构建声学模型。如需了解更多信息及文档,请访问<a href="http://www.CIEMPIESS.org/">CIEMPIESS-UNAM项目官网</a>。</p><br> <p>CIEMPIESS平衡语料库是CIEMPIESS轻量版的配套语料库,后者由语言数据联盟(Linguistic Data Consortium, LDC)以编号LDC2017S23发布。研发该语料库的目的是与CIEMPIESS轻量版共同构成性别分布均衡的数据集。CIEMPIESS轻量版的性别占比约为男性75%、女性25%,而CIEMPIESS平衡语料库的性别占比则约为男性25%、女性75%。</p><br> <p>语言数据联盟(LDC)已发布CIEMPIESS系列的以下数据集:</p><br> <ul><br> <li>CIEMPIESS(编号LDC2015S07)</li><br> <li>CHM150(编号LDC2016S04)</li><br> <li>CIEMPIESS轻量版(编号LDC2017S23)</li><br> <li>CIEMPIESS实验语料库(编号LDC2019S07)</li><br> </ul><br> <h3>数据</h3><br> <p>本数据集的绝大多数语音录制素材取自UNAM下属的广播电台Radio-IUS,其官网为<a href="http://www.derecho.unam.mx/cultura-juridica/radio.php">Radio-IUS官网</a>。其余录制素材则取自IUS多媒体频道(<a href="https://www.youtube.com/user/DEDUNAM/videos">IUS Canal Multimedia频道</a>)以及墨西哥国立自治大学法律研究中心(Centro Universitario de Estudios Jurídicos, CUEJ UNAM,<a href="https://www.youtube.com/channel/UCTxkzdUd0tiXT5BN5o6Xo-A/videos">CUEJ UNAM频道</a>)。后两个频道的视频内容均围绕法律议题及与UNAM相关的主题展开。</p><br> <p>本次发布的音频文件采用16 kHz、16位PCM编码的FLAC格式,转录文本则采用UTF-8编码的纯文本格式。</p><br> <h3>示例</h3><br> <p>请查看<a href="desc/addenda/LDC2018S11.flac">语音示例</a>与<a href="desc/addenda/LDC2018S11.txt">转录文本示例</a>。</p><br> <h3>致谢</h3><br> <p>作者谨向为“语音技术开发”社会服务项目提供支持的Alejandro V. Mena、Elena Vera与Angélica Gutiérrez致以谢意,同时感谢参与社会服务的学生们的辛勤付出。</p><br> <h3>更新情况</h3><br> <p>暂无更新。</p><br> 部分内容 © 2018 Carlos Daniel Hernández Mena、© 2018 墨西哥国立自治大学、© 2018 宾夕法尼亚大学托管委员会
提供机构:
Linguistic Data Consortium
创建时间:
2020-11-30
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作