CIEMPIESS

Name: CIEMPIESS
Creator: Linguistic Data Consortium
Published: 2021-07-01 16:27:15
License: 暂无描述

DataCite Commons2021-07-01 更新2025-04-16 收录

下载链接：

https://catalog.ldc.upenn.edu/LDC2015S07

下载链接

链接失效反馈

官方服务：

资源简介：

<h3>Introduction</h3><br> <p>CIEMPIESS (Corpus de Investigación en Español de México del Posgrado de Ingeniería Eléctrica y Servicio Social) was developed by the <a href="http://odin.fi-b.unam.mx/profesores/abelherrera/">Speech Processing Laboratory</a> of the Faculty of Engineering at the <a href="http://www.unam.mx/">National Autonomous University of Mexico</a> (UNAM) and consists of approximately 18 hours of Mexican Spanish radio speech, associated transcripts, pronouncing dictionaries and language models. The goal of this work was to create acoustic models for automatic speech recognition.</p><br> <p>For more information and documentation see the <a href="http://www.ciempiess.org/">CIEMPIESS-UNAM Project website</a>.</p><br> <p>LDC has released the following data sets in the CIEMPIESS series:</p><br> <ul><br> <li>CHM150 (<a href="../../../LDC2016S04">LDC2016S04</a>)</li><br> <li>CIEMPIESS Light (<a href="../../../LDC2017S23">LDC2017S23</a>)</li><br> <li>CIEMPIESS Balance (<a href="../../../LDC2018S11">LDC2018S11</a>)</li><br> <li>CIEMPIESS Experimentation (<a href="../../../LDC2019S07">LDC2019S07</a>)</li><br> </ul><br> <h3>Data</h3><br> <p>The speech recordings are from 43 one-hour FM radio programs broadcast by <a href="http://www.derecho.unam.mx/cultura-juridica/radio.php">Radio IUS</a>, a UNAM radio station. They are comprised of spontaneous conversations between a radio moderator and guests, principally about legal issues. Approximately 78% of the speakers were males, and 22% of the speakers were females.</p><br> <p>The audio was recorded in MP3 stereo format, using a 44.1 kHz sample rate and a bit-rate of 128 kbps or higher. Only "clean" utterances were selected from the raw data, meaning that the utterances were made by one only person with no background noises, whispers, music, foreign accents, white noise or static. The audio files were converted to 16 kHz, 16-bit PCM WAV format for this release.</p><br> <p>The recordings were transcibed using <a href="http://www.fon.hum.uva.nl/praat/">PRAAT</a>, a tool designed for phonetics research. The transcripts are in Mexbet, a phonetic alphablet designed for Mexican Spanish based on Worldbet (Hieronymus, 1994). Plain text transcripts, textgrid format time labels and files useful for performing experiments with the <a href="http://www.cs.cmu.edu/~archan/sphinxInfo.html">SPHINX3</a> recognition software are also included.</p><br> <h3>Samples</h3><br> <p>Please view this <a href="desc/addenda/LDC2015S07.wav">audio sample</a> and <a href="desc/addenda/LDC2015S07.txt">transcript sample</a>.</p><br> <h3>Updates</h3><br> <p>None at this time.</p></br> Portions © 2015 Carlos Daniel Hernández Mena, © 2014 Universidad Nacional Autónoma de México, © 2015 Trustees of the University of Pennsylvania

提供机构：

Linguistic Data Consortium

创建时间：

2020-11-30

5,000+

优质数据集

54 个

任务类型

进入经典数据集