TIMIT
收藏帕依提提2024-03-04 收录
下载链接:
https://www.payititi.com/opendatasets/show-123.html
下载链接
链接失效反馈官方服务:
资源简介:
The TIMIT corpus of read speech has been designed to provide the speech research community with a standardized corpus for the acquisition of acoustic-phonetic knowledge and for the development and evaluation of automatic speech recognition systems. The creation of any reasonably-sized speech corpus is very labor intensive. With this in mind, TIMIT was designed so as to balance utility and manageability, containing small amounts of speech from a relatively diverse speaker population and a range of phonetic environments. This section provides more detailed information on the contents of TIMIT and on the division of the TIMIT speech material into subsets for training and testing purposes. TIMIT contains a total of 6300 utterances, 10 sentences spoken by each of 630 speakers from 8 major dialect divisions of the United States. The 10 sentences represent roughly 30 seconds of speech material per speaker. In total, the corpus contains approximately 5 hours of speech. All speakers are native speakers of American English and were judged by a professional speech pathologist to have no clinical speech pathologies. The speakers were primarily TI personnel, many of whom were new to TI and the Dallas area. They were selected to be representative of different geographical dialect regions of the U.S.2 A speaker's dialect region was defined as the geographical area of the U.S. where he or she lived during their childhood years (age 2 to 10). The geographical areas correspond with recognized dialect regions of the U.S. (Language Files, Ohio State University Linguistics Dept., 1982), with the exception of the Western dialect region (dr7) in which dialect boundaries are not known with any confidence and "dialect region" 8 where the speakers moved around a lot during their childhood. The locale of each speaker's childhood is indicated by a color-coded marker on the map. Recordings were made in a noise-isolated recording booth at TI, using a semi-automatic computer system (STEROIDS) to control the presentation of prompts to the speaker and the recording. Two-channel recordings were made using a Sennheiser HMD 414 headset-mounted microphone and a Breul & Kjaer 1/2" far-field pressure microphone (#4165). The speech was directly digitized at a sample rate of 20 kHz using a Digital Sound Corporation DSC 200 with the anti-aliasing filter at 10 kHz. The speech was then digitally filtered, debiased, and downsampled to 16 kHz. Subjects were seated in the recording booth and prompts were presented on a monitor. The subjects wore earphones through which a low-level (approximately 53 dB SPL) of background noise was played to eliminate the unusual voice quality produced by the "dead room" effect. TI attempted to keep both the recording gain and the level of noise in the subject's earphones constant during the collection. At the beginning of each recording day, a standard calibration tone was recorded from each microphone and the voltage at the subject's earphones was checked and adjusted as necessary. The speakers were given minimal instructions and asked to read the prompts in a "natural" voice. The recordings were monitored, and any suspected mispronunciations were flagged for verification. Verification consisted of listening to the utterance by both the monitor and the speaker. When a pronunciation error was detected, the sentence was re-recorded. Variant pronunciations were not counted as mistakes.
读式语音语料库(TIMIT corpus)专为语音研究领域打造,旨在为其提供标准化语料库,用于声学语音学知识的获取,以及自动语音识别系统的开发与评估。
构建任何规模适中的语音语料库均属于劳动密集型工作。鉴于此,TIMIT在设计之初便兼顾实用性与可管理性,收录了来自相对多元的说话人群体的少量语音样本,并覆盖多种语音环境。本节将详细介绍TIMIT的内容,以及将TIMIT语音素材划分为训练子集与测试子集的具体方式。
TIMIT总计包含6300条语音片段,来自美国8大方言分区的630名说话人,每名说话人录制10句语句。每名说话人的语音素材时长约为30秒,整个语料库的总时长约为5小时。所有说话人均为美国本土英语使用者,且经专业言语病理学家诊断,无临床言语病理异常。
说话人主体为TI员工,其中不少人刚入职TI且居住于达拉斯地区。这些说话人经遴选,需能够代表美国不同地理方言区域。说话人的方言区域以其童年时期(2至10岁)所居住的美国地理区域为准。该地理分区与美国公认的方言区域(《语言文件》,俄亥俄州立大学语言学系,1982)基本对应,但西部方言分区(dr7)除外——该分区的方言边界尚无可靠定论;以及第8方言区域,该区域的说话人在童年时期频繁迁居。每个说话人的童年所在地会通过地图上的颜色编码标记进行标注。
录音工作在TI的隔音录音棚内完成,采用半自动计算机系统(STEROIDS)控制提示文本的呈现与录音流程。录音采用双声道制式,分别使用森海塞尔HMD 414头戴式麦克风与布吕埃尔&凯雅(Brüel & Kjær)1/2英寸远场压力传声器(型号4165)。语音以20 kHz的采样率直接数字化,所用设备为Digital Sound Corporation DSC 200,抗混叠滤波器截止频率设为10 kHz。随后,语音信号经数字滤波、去偏置并下采样至16 kHz。
受试者坐在录音棚内,提示文本显示于显示器上。受试者佩戴耳机,耳机中播放约53 dB声压级的背景噪声,以消除“死室效应”导致的异常语音音质。TI在数据采集过程中,力求保持录音增益与受试者耳机内的噪声电平恒定。每个录音日开始时,会使用每个传声器录制标准校准音,并检查并调整受试者耳机处的电压至所需参数。
研究人员仅向说话人提供极简指导,要求其以“自然”的语调朗读提示文本。录音过程会被实时监听,任何疑似发音错误的片段都会被标记以供复核。复核流程由监听器与说话人共同聆听该语音片段完成。若检测到发音错误,则需重新录制该语句。变体发音不被认定为发音错误。
提供机构:
帕依提提
搜集汇总
数据集介绍

背景与挑战
背景概述
TIMIT是一个标准化的语音语料库,专为语音研究社区设计,用于声学-语音学知识获取和自动语音识别系统的开发与评估。它包含6300个话语,由630名来自美国8个主要方言区的说话者录制,每人贡献10个句子,总计约5小时语音,所有说话者均为母语美式英语且无语音病理。录音在专业隔音环境中进行,采用高质量设备,确保了数据的多样性和准确性。
以上内容由遇见数据集搜集并总结生成



