five

WTIMIT 1.0

收藏
Mendeley Data2024-01-31 更新2024-06-27 收录
下载链接:
https://catalog.ldc.upenn.edu/LDC2010S02
下载链接
链接失效反馈
官方服务:
资源简介:
Introduction WTIMIT 1.0 is a wideband mobile telephony derivative of TIMIT Acoustic-Phonetic Continuous Speech Corpus (TIMIT, LDC93S1). TIMIT contains wideband speech recordings (i.e., sampled at 16 kHz) of 630 speakers in American English from eight major dialectic regions, each reading ten phonetically rich sentences. The TIMIT speech corpus was completed in 1993, being intended for acoustic-phonetic studies as well as for development and evaluation of automatic speech recognition (ASR) systems. In the meantime, five TIMIT derivatives have been developed: FFMTIMIT, NTIMIT, CTIMIT, HTIMIT, and STC-TIMIT. The FFMTIMIT (LDC96S32) corpus (Free-Field Microphone TIMIT) consists of the original TIMIT database, being recorded by a free-field microphone. NTIMIT (LDC93S2) (Network TIMIT) serves as a telephone bandwidth adjunct to TIMIT, containing its speech files transmitted over a telephone handset and the NYNEX telephone network, subject to a large variety of channel conditions. For the cellular bandwidth speech corpus CTIMIT (LDC96S30), the original TIMIT recordings were passed through cellular telephone circuits. The HTIMIT (LDC98S67) corpus (Handset TIMIT) offers a TIMIT subset of 192 male and 192 female speakers through different telephone handsets for the study of telephone transducer effects on speech. For the single-channel telephone corpus STC-TIMIT (LDC2008S03), the TIMIT recordings were sent through a real and, in contrast to NTIMIT, single telephone channel. While some of these derivative TIMIT corpora consist of wideband speech, others are telephony corpora representing narrowband speech, i.e., sampled at 8 kHz and containing frequency components from about 300 Hz to 3.4 kHz. Until now, no real-world wideband telephony speech corpus has been publicly available. Due to upcoming wideband speech codecs, such as G.722, G.722.1, G.722.2 (i.e., Adaptive Multi-Rate Wideband, AMR-WB), and G.711.1, wideband telephony speech transmission is already feasible nowadays, even in an increasing number of mobile networks. Hence, a wideband telephone bandwidth adjunct to TIMIT is desirable for a wide range of scientific investigations, as well as development and evaluation of systems, e.g., Interactive Voice Response (IVR) systems. WTIMIT 1.0 (Wideband Mobile TIMIT) contains the recordings of the original TIMIT speech files after transmission over a real 3G AMR-WB mobile network. WTIMIT 1.0 is organized according to the original TIMIT corpus. The training subset consists of 4620 speech files, while the test subset contains 1680 speech files. The speech format of the WTIMIT corpus is raw (i.e., no header information) and specified as follows: 16 kHz sampling rate 16 bit, 1-channel linear PCM sampling format little-endian byte order signed Data Data preparation was conducted by converting the original TIMIT speech files into raw data (i.e., dropping the first 1024 bytes of header information) and concatenating them to 11 signal chunks of at most 30 minutes duration. In order to allow precise de-concatenation after transmission, and in order to be able to examine codec influence and channel distortion, each signal chunk is preceded by a 4 s calibration tone. It comprises 2 s of a 1 kHz sine wave followed by another 2 s of a linear sweep from 0 to 8 kHz. After having stored the prepared speech chunks on a laptop PC, they are ready for transmission over T-Mobile's AMR-WB-capable 3G mobile network in The Hague, The Netherlands. At the sending end, the speech chunks were played back by a laptop PC. Via an IEEE 1394 link (FireWire), the data was transmitted digitally to an external DAC (digital-to-analog converter) of type RME Fireface 400. The analog signal was then fed electrically into the microphone input of the transmitting Nokia 6220 mobile phone. For this purpose, an audio quality test cable for Nokia mobile phones was used. Prior to the actual transmission, the output attenuation of the DAC was adjusted such as to prevent analog saturation at the input circuit of the phone while ensuring optimal dynamic range. Furthermore, a call to the phone at the receiving end, a second mobile phone of type Nokia 6220, was established for each speech chunk separately. Using the field test monitoring software of the phones, we confirmed that they were situated in different network cells at all times during transmission; moreover, we verified that the proper speech codec, the widely used AMR-WB at a constant data rate of 12.65 kbit/s, was being employed. Note that this bitrate is by far the most widely used one. Furthermore, the internal microphone equalization of the transmitting mobile phone was switched off. At the receiving end, the analog headphone output of the receiving mobile phone was connected electrically to an ADC (analog-to-digital converter) of type RME Fireface 400. The analog input gain of the latter device was adjusted once initially to exploit the dynamic range of the ADC. Sampling was performed at a rate of 48 kHz, the native sampling rate of the ADC, and with 16 bit precision. The digital speech signals were transferred to a laptop PC again via an IEEE 1394 link and recorded onto a hard drive. The transmitted speech chunks were decimated from 48 kHz to 16 kHz sampling rate using a high-quality lowpass filter. Finally, they were de-concatenated by maximizing the cross-correlation between them and the original speech files. We followed the de-concatenation methodology of STC-TIMIT, as described in STC-TIMIT: Generation of a Single-channel Telephone Corpus, in order to assure a precise sample alignment to the TIMIT speech files. Hence, utterances in WTIMIT 1.0 can be considered to be time-aligned with an average precision of 0.0625 ms (one sample) with those of TIMIT. Basically, TIMIT's original label files (*.TXT, *.WRD, *.PHN) are valid for WTIMIT as well. However, misalignments of about 10 to 20 ms were found to be frequently produced by the channel mainly during speech pauses. Parts of the affected speech files are therefore slightly misaligned against the original label information. These channel effects may be related to the packet switching domain in the UMTS Core Network. Depending on the traffic load in the network, packets are buffered and queued, which results in a variable packet delay (jitter). If you have any problems, questions or suggestions concerning WTIMIT, please send a brief email to Tim Fingscheidt (Technische Universität Braunschweig, Braunschweig, Germany): fingscheidt@ifn.ing.tu-bs.de. Samples Please examine the following samples for an example of the data in this corpus (raw audio has been converted to wav for purposes of demonstration): Audio File Text Words Phonemes Acknowledgement The authors would like to thank Mr. Dirk Kistowski-Cames, Deutsche Telekom AG, Bonn, Germany, for providing general project support and SIM cards, and Mr. Petri Lang, T-Mobile NL, The Hague, The Netherlands, for local support and SIM cards. Thanks also to Mr. Panu Nevala, Nokia, Oulu, Finland, for providing the prepared mobile phones, which are in that form not available on the market. This work was funded by German Research Foundation (DFG) under grant no. FI 1494/2-1. Portions © 2009, 2010 Tim Fingscheidt, © 1993, 2010 Trustees of the University of Pennsylvania

Introduction WTIMIT 1.0是TIMIT声学-语音连续语音语料库(TIMIT Acoustic-Phonetic Continuous Speech Corpus,简称TIMIT, LDC93S1)的宽带移动电话衍生语料库。TIMIT收录了来自8个主要方言区的630名美式英语母语说话人的宽带语音录音,采样率为16 kHz,每位说话人朗读10句语音丰富度较高的句子。TIMIT语料库完成于1993年,旨在用于声学-语音研究以及自动语音识别(ASR)系统的开发与评估。 同期已开发出5款TIMIT衍生语料库:FFMTIMIT、NTIMIT、CTIMIT、HTIMIT以及STC-TIMIT。FFMTIMIT(LDC96S32,即自由场麦克风TIMIT)语料库基于原始TIMIT数据库,采用自由场麦克风录制。NTIMIT(LDC93S2,即网络TIMIT)作为TIMIT的电话带宽补充语料库,包含通过手机听筒与NYNEX电话网络传输的语音文件,涵盖多种信道条件。CTIMIT(LDC96S30,即蜂窝带宽TIMIT)语料库则将原始TIMIT录音通过蜂窝电话电路传输得到。HTIMIT(LDC98S67,即听筒TIMIT)语料库选取192名男性与192名女性说话人的TIMIT子集,通过不同型号的电话听筒录制,用于研究电话换能器对语音的影响。单信道电话语料库STC-TIMIT(LDC2008S03)则是将TIMIT录音通过真实的单电话信道传输得到,与NTIMIT不同的是其仅使用单一信道。 部分衍生TIMIT语料库为宽带语音语料库,其余则为代表窄带语音的电话语料库,即采样率为8 kHz、频率分量覆盖约300 Hz至3.4 kHz。截至目前,尚无公开可用的真实场景宽带电话语音语料库。随着G.722、G.722.1、G.722.2(即自适应多速率宽带编码,Adaptive Multi-Rate Wideband, AMR-WB)以及G.711.1等宽带语音编解码器的普及,如今宽带电话语音传输已具备可行性,且在越来越多的移动网络中得到应用。因此,亟需一款TIMIT的宽带电话带宽补充语料库,以满足各类科学研究,以及交互式语音应答(IVR)系统等场景的系统开发与评估需求。 WTIMIT 1.0(宽带移动TIMIT)语料库基于原始TIMIT语音文件,通过真实的支持AMR-WB的3G移动网络传输后得到。其组织结构与原始TIMIT语料库一致:训练子集包含4620条语音文件,测试子集包含1680条语音文件。WTIMIT语料库的语音格式为裸格式(即无头部信息),具体规范如下:16 kHz采样率、16位单通道线性脉冲编码调制(Linear Pulse Code Modulation,PCM)采样格式、小端字节序、有符号数据。 数据预处理流程为:将原始TIMIT语音文件转换为裸数据(即移除前1024字节的头部信息),并拼接为最多30分钟时长的11个信号块。为实现传输后的精准解拼接,同时便于分析编解码器影响与信道失真,每个信号块前均附加4秒的校准音:先播放2秒1 kHz正弦波,随后播放2秒0至8 kHz的线性扫频信号。将预处理后的语音块存储至笔记本电脑后,即可通过荷兰海牙的T-Mobile支持AMR-WB的3G移动网络进行传输。 发送端流程:通过笔记本电脑播放语音块,经IEEE 1394(FireWire)链路将数据数字传输至RME Fireface 400型外部数模转换器(DAC),模拟信号经电气连接输入至发射端诺基亚6220手机的麦克风输入接口,此过程使用了适配诺基亚手机的音频质量测试线缆。正式传输前,需调整DAC的输出衰减量,既避免手机输入电路出现模拟信号饱和,又确保获得最优动态范围。此外,针对每个语音块,均单独与接收端的另一台诺基亚6220手机建立通话。通过手机的现场测试监控软件,可确认传输全程中两部手机始终处于不同的网络小区中,且确认使用了广泛应用的恒定码率12.65 kbit/s的AMR-WB编解码器——该码率是目前应用最广泛的AMR-WB码率。同时,需关闭发射端手机的内置麦克风均衡功能。 接收端流程:将接收端手机的模拟耳机输出经电气连接至RME Fireface 400型模数转换器(ADC),预先一次性调整该设备的模拟输入增益,以充分利用ADC的动态范围。以ADC原生采样率48 kHz、16位精度进行采样。数字语音信号再次通过IEEE 1394链路传输至笔记本电脑并存储至硬盘。将传输后的语音块通过高质量低通滤波器从48 kHz降采样至16 kHz,随后通过最大化与原始TIMIT语音文件的互相关性完成解拼接。本研究沿用了STC-TIMIT的解拼接方法,即《STC-TIMIT: Generation of a Single-channel Telephone Corpus》中所述流程,以确保与TIMIT语音文件实现精准的样本对齐。因此,WTIMIT 1.0中的语音片段与TIMIT的对齐平均精度可达0.0625 ms(即1个采样点)。 TIMIT原始的标注文件(*.TXT、*.WRD、*.PHN)同样适用于WTIMIT。但研究发现,信道传输常导致语音停顿期间出现约10至20 ms的对齐偏差,部分受影响的语音文件与原始标注信息存在轻微错位。此类信道效应可能与UMTS核心网的分组交换域相关:根据网络流量负载,数据包会被缓存与排队,从而产生可变的数据包延迟(抖动)。 若您对WTIMIT有任何问题、疑问或建议,请发送简短邮件至Tim Fingscheidt(德国布伦瑞克工业大学,布伦瑞克,德国):fingscheidt@ifn.ing.tu-bs.de。 Samples 以下为该语料库的数据样例(为便于演示,已将裸音频转换为WAV格式): | 音频文件 | 文本 | 单词 | 音素 | Acknowledgement 致谢:作者谨此感谢德国波恩德国电信集团(Deutsche Telekom AG)的Dirk Kistowski-Cames先生,感谢其提供项目通用支持与SIM卡;同时感谢荷兰海牙T-Mobile NL的Petri Lang先生,感谢其提供本地支持与SIM卡。此外,感谢芬兰奥卢诺基亚公司的Panu Nevala先生提供定制手机——该款手机目前尚未在市面发售。本研究获得德国研究基金会(DFG)资助,项目编号为FI 1494/2-1。本研究部分内容©2009、2010 Tim Fingscheidt,©1993、2010 宾夕法尼亚大学理事会。
创建时间:
2024-01-31
搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
WTIMIT 1.0是一个基于TIMIT语料库的宽带移动电话语音数据集,适用于语音识别和说话人识别研究。它通过3G AMR-WB网络传输,保留了16kHz采样率和16位单声道线性PCM格式,与原始TIMIT时间对齐,但存在微小的时间偏移。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作