five

cjweaver/ARU_speech_corpus

收藏
Hugging Face2026-04-30 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/cjweaver/ARU_speech_corpus
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 language: - en task_categories: - automatic-speech-recognition - audio-classification pretty_name: ARU Speech Corpus size_categories: - 1K<n<10K tags: - speech - british-english - ieee-sentences - anechoic - audio --- # Dataset Card for ARU Speech Corpus The ARU Speech Corpus is a high-quality collection of IEEE (Harvard) sentences recorded in anechoic conditions by twelve native British English speakers. This dataset was created at the University of Liverpool's Acoustics Research Unit for speech intelligibility research. ## Dataset Details ### Dataset Description The ARU speech corpus comprises single-channel recordings of 720 IEEE sentences spoken by twelve adult native British English speakers (6 male, 6 female) in controlled anechoic conditions. All recordings were made in October and November 2017 using professional-grade audio equipment in the Acoustics Research Unit's anechoic chamber. The corpus features high sampling rates (65,536 Hz), 24-bit depth, and careful signal processing to ensure consistent speech levels across all recordings. Speakers were selected for near Received Pronunciation accents and underwent audiometric screening to ensure normal hearing ability. - **Curated by:** Dr. Simone Graetzer, Dr. Gary Seiffert, and Professor Carl Hopkins (Acoustics Research Unit, University of Liverpool) - **Funded by:** HM Government - **Shared by:** University of Liverpool, Acoustics Research Unit - **Language(s) (NLP):** English (en-GB, British English) - **License:** CC-BY-4.0 (verify based on original repository terms) ### Dataset Sources - **Repository:** https://datacat.liverpool.ac.uk/681/ - **Paper:** Hopkins, C., Graetzer, S., Seiffert, G. (2019). ARU adult British English speaker corpus of IEEE sentences (ARU speech corpus) version 1.0 ## Uses ### Direct Use This dataset is suitable for: - **Automatic Speech Recognition (ASR)** training and evaluation, particularly for British English - **Speech intelligibility research** in noise and reverberant conditions - **Speaker recognition and verification** systems - **Accent classification** and dialect studies - **Speech quality assessment** benchmarking - **Audio signal processing** algorithm development - **Text-to-speech (TTS)** evaluation using reference speech - **Acoustic model training** for British English variants The high sampling rate (65,536 Hz) makes it particularly valuable for wideband and super-wideband speech processing research. ### Out-of-Scope Use <!-- This section addresses misuse, malicious use, and uses that the dataset will not work well for. --> This dataset should not be used for: - **Speaker identification for surveillance purposes** - violates participant consent terms - **Biometric authentication systems** - participants did not consent to such use - **Training models for strongly regional British accents** - speakers were specifically selected for near Received Pronunciation - **Emotional speech recognition** - recordings were made with neutral, conversational delivery - **Spontaneous speech modeling** - content is read speech from standardized sentence lists - **Multi-speaker or overlapping speech scenarios** - all recordings are single-speaker - **Noisy or reverberant speech modeling** - recordings made in anechoic conditions ## Dataset Structure <!-- This section provides a description of the dataset fields, and additional information about the dataset structure such as criteria used to create the splits, relationships between data points, etc. --> ### Data Instances Each instance contains: - **audio**: Audio file at 65,536 Hz sampling rate, 24-bit depth - **speaker_id**: Two-digit identifier (01-12) - **sex**: Speaker gender (M/F) - **age**: Speaker age in years (21-47) - **accent**: Geographic origin (county where primary/secondary education completed) - **list_number**: IEEE word list number (1-72) - **sentence_number**: Sentence number within list (1-10) - **text**: IEEE sentence transcription (if available) ### Data Splits Due to the limited number of speakers (12 total), splits are designed to maintain 50/50 gender balance: | Split | Speakers | Percentage | Files | |-------|----------|------------|-------| | Train | 8 (4M, 4F) | 67% | ~5,760 | | Test | 2 (1M, 1F) | 17% | ~1,440 | | Validation | 2 (1M, 1F) | 16% | ~1,440 | **Total**: 8,640 utterances (12 speakers × 720 sentences) ### File Naming Convention Files follow the pattern: `ID{speaker}_ARU_Fs=65536Hz_Standard speech - List {list_num} - Sentence {sent_num} - Version 1_0.wav` Example: `ID01_ARU_Fs=65536Hz_Standard speech - List 1 - Sentence 1 - Version 1_0.wav` ## Dataset Creation ### Curation Rationale <!-- Motivation for the creation of this dataset. --> This corpus was created as part of a larger research project investigating speech intelligibility in noise. The goal was to obtain high-quality reference recordings of standardized speech materials (IEEE sentences) from native British English speakers in controlled acoustic conditions. The anechoic recording environment eliminates room reflections, making the recordings suitable for adding controlled acoustic conditions in post-processing for intelligibility studies. ### Source Data <!-- This section describes the source data (e.g. news text and headlines, social media posts, translated sentences, ...). --> #### Data Collection and Processing <!-- This section describes the data collection and processing process such as data selection criteria, filtering and normalization methods, tools and libraries used, etc. --> **Recording Setup:** - **Environment**: ARU anechoic chamber (internal dimensions 5m × 4m × 2.6m) - **Microphone**: Brüel & Kjær Type 4190 free-field half-inch microphone - **Preamplifier**: Brüel & Kjær Type 2669 (No. 3004348) - **Conditioning amplifier**: Brüel & Kjær Nexus (Serial 2301697) - **Generator module**: Brüel & Kjær LAN-XI Type 3160-A 4/2 - **Recording software**: Brüel & Kjær Pulse Time Data Recorder v20 - **Microphone distance**: 1m on-axis from speaker - **Sampling rate**: 65,536 Hz - **Bit depth**: 24 bits per sample **Signal Processing:** 1. High-pass filtering to remove energy below 60 Hz (Finite Impulse Response filter with Kaiser window method) 2. Low-pass filtering to attenuate energy above 9 kHz (removes electrical background noise) 3. Normalization using the activlev function from VOICEBOX (Brookes, 2014-2016) to achieve consistent active speech levels according to ITU-T P.56 (2011) **Recording Procedure:** - Speakers seated comfortably in anechoic chamber - Instructed to speak with "normal vocal effort, as you would in everyday conversation" - Sentences presented in randomized order - Video monitoring to ensure speakers faced microphone - Repetition allowed for hesitations or errors #### Who are the source data producers? <!-- This section describes the people or systems who originally created the data. It should also include self-reported demographic or identity information for the source data creators if this information is available. --> **Speaker Demographics:** | ID | Gender | Age | Geographic Origin (County, Country) | |----|--------|-----|-------------------------------------| | 01 | M | 47 | Avon, England | | 02 | M | 21 | Ceredigion, Wales | | 03 | F | 23 | Berkshire, England | | 04 | F | 35 | Surrey and Middlesex, England | | 05 | M | 35 | Denbighshire and Conwy, Wales | | 06 | M | 47 | Kent, England | | 07 | F | 24 | Norfolk, England | | 08 | F | 32 | Merseyside, England | | 09 | F | 44 | Wirral, England | | 10 | M | 29 | Cheshire, England | | 11 | F | 45 | East Sussex, England | | 12 | M | 32 | Leicestershire, England | **Selection Criteria:** - Native British English speakers (first language) - Age range: 20-60 years (actual: 21-47) - Completed all primary and secondary schooling in the UK - Accents not strongly regional (preference for near Received Pronunciation) - Non-smokers with no recent smoking history - No history of speech disorders or treatment by speech pathologist - No medical conditions affecting vocal apparatus (vocal folds, larynx, trachea, pharynx, esophagus, respiratory system) - Not taking medications affecting speech-related anatomy - Self-reported normal hearing ability - Passed audiometric screening: thresholds of 20 dB HL or better (age-adjusted) at frequencies from 125 Hz to 8 kHz (per BS EN ISO 8253-1:2010) ### Annotations <!-- If the dataset contains annotations which are not part of the initial data collection, use this section to describe them. --> #### Annotation process <!-- This section describes the annotation process such as annotation tools used in the process, the amount of data annotated, annotation guidelines provided to the annotators, interannotator statistics, annotation validation, etc. --> The dataset uses IEEE (Harvard) sentences, which are standardized phonetically-balanced sentences commonly used in speech research. Transcriptions are available from the IEEE standard (IEEE, 1969). Speaker metadata (ID, gender, age, accent) was collected during participant screening and is encoded in filenames and metadata fields. #### Who are the annotators? <!-- This section describes the people or systems who created the annotations. --> Not applicable - the dataset uses pre-existing IEEE sentence materials. Metadata was collected by the research team during participant recruitment and screening. #### Personal and Sensitive Information <!-- State whether the dataset contains data that might be considered personal, sensitive, or private (e.g., data that reveals addresses, uniquely identifiable names or aliases, racial or ethnic origins, sexual orientations, religious beliefs, political opinions, financial or health data, etc.). If efforts were made to anonymize the data, describe the anonymization process. --> **Privacy Protections:** - Participants are identified only by anonymous ID numbers (01-12) - No names, contact details, or uniquely identifiable information is included - Only aggregate demographic information is shared: age (in years) and county of education - All participants provided informed consent specifically for public distribution of recordings - Participants were explicitly informed that age and educational county would be associated with their recordings **Consent Process:** - Two-part screening procedure with full informed consent - Participants received written information sheets explaining data usage - Explicit consent obtained for public distribution via the ARU website - Participants paid £15 per recording session (Tesco vouchers) - Right to withdraw data until November 30, 2018 (before public release) ## Bias, Risks, and Limitations <!-- This section is meant to convey both technical and sociotechnical limitations. --> **Demographic Limitations:** - **Limited speaker diversity**: Only 12 speakers total - **Age range**: 21-47 years (excludes children, older adults) - **Geographic bias**: Primarily England (10 speakers), limited Welsh representation (2 speakers) - **Accent bias**: Selected for near Received Pronunciation, not representative of regional British English varieties - **Health bias**: Excludes speakers with any hearing loss, speech disorders, or smoking history - **Socioeconomic bias**: Likely skewed toward university-affiliated individuals **Technical Limitations:** - **Anechoic conditions**: Not representative of real-world acoustic environments - **Read speech only**: Does not capture spontaneous speech characteristics - **Limited phonetic content**: Restricted to IEEE sentence set - **Single-channel**: No multi-microphone or spatial audio data - **High sampling rate**: 65,536 Hz may require downsampling for many applications **Ethical Considerations:** - Voice biometrics could potentially identify speakers despite anonymization - Dataset should not be used for surveillance or unauthorized speaker identification - Limited diversity may lead to biased model performance on underrepresented demographics ### Recommendations <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. --> Users should: - **Acknowledge limitations** when reporting results, particularly regarding accent and demographic diversity - **Downsample appropriately** for applications not requiring super-wideband audio (most ASR systems use 16 kHz) - **Combine with other datasets** for more demographically diverse training data - **Respect participant consent** by not using recordings for biometric identification or surveillance - **Consider acoustic mismatch** when applying models trained on anechoic speech to real-world conditions - **Evaluate fairness** across age, gender, and accent when using for model development - **Cite properly** using the provided citation information - **Verify license compliance** for commercial applications ## Citation <!-- If there is a paper or blog post introducing the dataset, the APA and Bibtex information for that should go in this section. --> **BibTeX:** ```bibtex @misc{hopkins2019aru, author = {Hopkins, Carl and Graetzer, Simone and Seiffert, Gary}, title = {ARU adult British English speaker corpus of IEEE sentences (ARU speech corpus) version 1.0}, year = {2019}, publisher = {University of Liverpool}, howpublished = {Acoustics Research Unit, School of Architecture, University of Liverpool}, doi = {10.17638/datacat.liverpool.ac.uk/681}, url = {https://datacat.liverpool.ac.uk/681/} } ``` **APA:** Hopkins, C., Graetzer, S., & Seiffert, G. (2019). ARU adult British English speaker corpus of IEEE sentences (ARU speech corpus) version 1.0 [Data set]. Acoustics Research Unit, School of Architecture, University of Liverpool. https://doi.org/10.17638/datacat.liverpool.ac.uk/681 ## Glossary IEEE sentences: Phonetically-balanced sentences from IEEE Recommended Practice for Speech Quality Measurements (1969), also known as Harvard sentences Anechoic chamber: Room designed to absorb sound reflections, creating a reflection-free environment Received Pronunciation (RP): Accent traditionally considered standard British English, historically associated with educated speakers in southern England Active speech level: Speech level measurement excluding pauses, per ITU-T P.56 dB HL: Decibels Hearing Level, audiometric measurement relative to normal hearing thresholds Sampling rate 65,536 Hz: Super-wideband sampling rate (2^16 Hz), capturing frequencies up to ~32 kHz ## More Information Related References: IEEE (1969). Recommended practice for speech quality measurements. IEEE Transactions on Audio and Electroacoustics, 17(3), 227-246. ITU-T P.56 (2011). Objective measurement of active speech level. International Telecommunication Union. Brookes, M. (2014-2016). VOICEBOX: Speech Processing Toolbox for MATLAB. http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html BS EN ISO 8253-1:2010. Acoustics: Audiometric test methods part 1: Basic pure tone air and bone conduction threshold audiometry. Contact Information: Acoustics Research Unit, School of Architecture University of Liverpool Abercromby Square, Liverpool L69 7ZN, United Kingdom ### Dataset Card Authors Chris Weaver, Logitech Inc. ### Dataset Card Contact For questions about this dataset card or the Hugging Face repository, contact cweaver@logitech.com For questions about the original dataset, contact the Acoustics Research Unit at the University of Liverpool or email: carl.hopkins@liv.ac.uk

The ARU Speech Corpus is a high-quality collection of IEEE (Harvard) sentences recorded in anechoic conditions by twelve native British English speakers. This dataset was created at the University of Liverpools Acoustics Research Unit for speech intelligibility research. The ARU speech corpus comprises single-channel recordings of 720 IEEE sentences spoken by twelve adult native British English speakers (6 male, 6 female) in controlled anechoic conditions. All recordings were made in October and November 2017 using professional-grade audio equipment in the Acoustics Research Units anechoic chamber. The corpus features high sampling rates (65,536 Hz), 24-bit depth, and careful signal processing to ensure consistent speech levels across all recordings. Speakers were selected for near Received Pronunciation accents and underwent audiometric screening to ensure normal hearing ability.
提供机构:
cjweaver
搜集汇总
数据集介绍
main_image_url
构建方式
该数据集由利物浦大学声学研究单元精心构建,旨在为语音清晰度研究提供高质量参考录音。12位母语为英式英语的成人(6男6女)在消声室环境中,以专业级设备录制了720句IEEE(哈佛)标准化句子。录音采用65,536 Hz采样率、24位深度,通过Brüel & Kjær设备完成。后期处理中,运用有限脉冲响应滤波器去除60 Hz以下能量与9 kHz以上电气噪声,并依据ITU-T P.56标准,借助VOICEBOX工具对所有语音进行一致的活动语音水平归一化,确保录音间电平均衡。
使用方法
此数据集广泛适用于自动语音识别(ASR)系统的训练与评估,尤其针对英式英语。其高采样率特性使其在宽带及超宽带语音处理研究中极具价值,亦可用于说话人识别、语音清晰度增强、文本转语音评估及音频信号处理算法开发。使用时需注意,消声环境与朗读式语体与实际应用场景存在声学差异,用户应适当降采样以适配常见16 kHz系统,并结合其他数据集提升多样性,同时避免将其用于生物特征识别或监控等超出参与者同意范围的应用。
背景与挑战
背景概述
ARU Speech Corpus是由利物浦大学声学研究单元于2017年创建的高质量语音数据集,旨在为语音清晰度研究提供标准化的参考语料。该数据集由Simone Graetzer博士、Gary Seiffert博士和Carl Hopkins教授主导构建,得到英国政府资助,并于2019年正式发布。核心研究问题聚焦于在可控声学环境下采集标准化的英式英语语音材料,以支持自动语音识别、说话人识别及音频信号处理等领域的算法开发与评估。数据集包含12位母语为英式英语的发音人(男女各半)在消声室中录制的720句IEEE句子的单通道音频,采样率高达65536赫兹,位深24比特,在语音清晰度研究和宽频语音处理领域具有重要影响力。
当前挑战
该数据集所解决的领域挑战包括:1) 语音识别系统在标准英式英语上缺乏高质量、经过严格声学控制的参考语料,导致模型对受控语音条件的泛化能力评估不准确;2) 消声室环境下的纯净录音为后期添加可控噪声和混响提供了基准,但现实应用场景中语音的声学复杂性难以被完全模拟。构建过程中的挑战在于:1) 发音人筛选标准严格,需满足近Received Pronunciation口音、无听力障碍或言语疾病等条件,导致仅有12位合格发音人,数据集的多样性和规模受限;2) 录音需在消声室条件下进行,虽保证了信号纯净度,却增加了采集成本与时间,且无法涵盖自然对话中的语速、韵律变化。
常用场景
经典使用场景
在自动语音识别与语种辨识领域,ARU Speech Corpus凭借其高采样率与标准化IEEE语句的优势,成为英式英语声学模型训练的基石。该数据集在消声条件下录制,确保了纯净的语音信号,广泛用于评估语音识别系统在理想声环境下的性能基准。同时,其12位说话者的均衡性别与近似标准发音特征,为说话人识别与验证任务提供了高质量的对照实验数据,尤其在探索口音泛化能力的研究中具有不可替代的作用。
解决学术问题
该数据集精准回应当前语音科学中的核心挑战:如何构建受控声学条件下的高保真参考语料。其解决了声学模型对背景噪声与混响敏感性评估缺乏标准化基准的问题,通过消声环境下的精确保留,为语音清晰度建模提供了可重复的实验基础。此外,它填补了英式英语标准发音语料在超宽带频域研究的空白,推动了ITU-T P.56主动语音电平算法在数据集上的验证,从而提升了对噪声干扰下语音处理算法的客观评价精度。
实际应用
在实际应用中,ARU Speech Corpus主要服务于高端助听器算法开发与语音通信系统的音质优化。其超高采样率使得音频工程师能够在超宽带范围内测试降噪与回声消除算法。同时,该数据集被用于辅助构建教育类语音学习软件,帮助学习对照发音规范。在公共安全领域,其标准化语句亦被应用于国防项目的语音指令识别系统原型测试,确保系统在受控环境中具备稳定的响应能力。
数据集最近研究
最新研究方向
面向消声环境下英式英语语音感知与识别的高保真基准数据集构建与前沿应用研究
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作