ai-coustics/dawn_chorus_en

Name: ai-coustics/dawn_chorus_en
Creator: ai-coustics
Published: 2026-03-27 14:58:37
License: 暂无描述

Hugging Face2026-03-27 更新2026-04-05 收录

下载链接：

https://hf-mirror.com/datasets/ai-coustics/dawn_chorus_en

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-nc-4.0 task_categories: - audio-to-audio language: - en tags: - speech - foreground-background-speech - speech-to-text pretty_name: dawn_chorus_en size_categories: - n<1K configs: - config_name: default data_files: - split: eval path: "eval.parquet" dataset_info: features: - name: mix dtype: audio: sampling_rate: 16000 - name: speech dtype: audio: sampling_rate: 16000 - name: transcript dtype: string - name: id dtype: string - name: language dtype: string - name: speaker_id dtype: string - name: conversation_type dtype: string - name: speech_source dtype: string - name: index dtype: int64 --- # dawn_chorus_en An open-source evaluation dataset for accurate foreground speaker transcription. The dataset targets mixture conditions where foreground speech remains generally transcribable by speech-to-text systems, while background speech is distinctly perceived as background. It provides around 90 minutes of foreground–background speech mixtures composed of recorded and synthesized foreground speech, along with ground truth foreground speech and corresponding transcripts. Inspired by [DAPS](https://ccrma.stanford.edu/~gautham/Site/daps.html), which frames speech enhancement as a direct transformation from real-world device recordings to professionally produced studio speech via aligned input–output pairs, we design this dataset around an equally application-driven mapping: from realistic foreground–background speech mixtures to isolated primary-speaker speech that remains robustly transcribable by downstream STT systems. Like DAPS, our approach emphasizes time-aligned references and real recording / transmission conditions rather than purely synthetic degradations, enabling evaluation of suppression strength versus foreground speech distortion. ## Dataset Description ### Direct Use This dataset is intended for evaluation of models that suppress background speech while preserving a primary/foreground speaker in conditions relevant to downstream speech-to-text (STT) systems. Recommended uses include: - Benchmarking background speech suppression performance on realistic multi-speaker mixtures - Measuring STT robustness by computing word error rate (WER) on processed mixtures and comparing against reference transcripts - Evaluating primary-speaker isolation / target-speaker extraction systems - Comparing speech enhancement model trade-offs between suppression strength and foreground speech distortion ### Technical Details - 450 `mix` and `speech` pairs of equal length with a sum duration of 01:31:19 [hh:mm:ss] - Minimum duration: 5.43 s - Maximum duration: 17.77 s - Mean duration: 12.18 s - 16 kHz sampling rate, 16-bit, mono - Foreground speech source distribution: 65 % recorded speech (19 speakers), 35 % synthesized speech (7 speakers) - Voice gender distribution (self-identified): 44 % female-sounding voices, 56 % male-sounding voices - Transmission channels distribution: 67 % GSM, 16.5 % WhatsApp, 16.5 % Telegram ### Dataset Structure Each row in the dataset contains the following fields: - **`mix`**: 16 kHz WAV audio of foreground speech mixed with background speech (mixtures) - **`speech`**: 16 kHz WAV audio of foreground speech - **`transcript`**: Ground-truth transcription corresponding to `speech` audio - **`id`**: Unique sample identifier following the scheme: `language_speakerID_conversationType_speechSource_index` - **`language`**: Language code of the utterance - **`speaker_id`**: Numeric identifier of the speaker - **`conversation_type`**: Type of speech interaction: - `interactive`: dialog-style or conversational speech - `narrative`: monologic or storytelling speech - **`speech_source`**: Origin of the foreground speech: - `human`: human speech - `machine`: machine generated speech - **`index`**: Integer index distinguishing multiple samples from the same speaker ### Dataset Sources **Foreground speech** - Either - produced by the ai-coustics recording campaign - Single-speaker recordings made with a Schoeps MK4 condenser microphone inside an acoustically treated, nearly anechoic recording booth - ~10–15 cm distance between speaker and microphone - Recordings were denoised and cleaned for mouth noises, clicks, plosives, and rustling sounds. No further EQing or compression was applied, although some proximity effect is present - Conversational and narrative styles - or - synthesized via text-to-speech models - Included for augmentation and prosodic diversity - Reflecting real-world production scenarios in which synthetic voices are increasingly used in conversational contexts **Background speech** - Public-domain, non-anechoic, degraded speech recordings in the target language, including informational, conversational, and narrative styles as well as background music and noise - Selected to represent realistic competing-speaker characteristics (prosody, speaking rate, articulation variability) **Transcriptions** - The foreground speech recordings were transcribed by professional linguists through a specialized audio transcription service. All transcripts are fully human-produced and quality-checked to ensure high accuracy and linguistic reliability. ### Dataset Production - Foreground speech was played through an [artificial mouth](https://www.grasacoustics.com/products/mouth-simulators/product/280-44aa) in proximity to one of the following recording devices 1. Samsung S22 in hands-free mode - transmitting audio via either - GSM network - WhatsApp call - to - Google Pixel 6A 2. MacBook Pro M4 - transmitting audio via - Telegram call - to - Google Pixel 6A - Background speech was played back simultaneously in an immersive loudspeaker setup and was recorded within the previously mentioned recording setups ![Behind the scenes](./dawn_chorus_bts.jpeg) *Behind the scenes* ## Dataset Details - **Curated by:** Leonardo Nerini, Butch Warns, Joschka Wohlgemuth, Luis Küffner, Théo Fuhrmann - **Funded by:** ai-coustics GmbH - **Language:** English - **License:** CC BY-NC 4.0 - **Contact:** - Email: info@ai-coustics.com - Web: https://ai-coustics.com ### Citation ```bibtex @dataset{dawn_chorus_en, title = {dawn_chorus_en: An evaluation dataset for accurate foreground speaker transcription}, author = {Leonardo Nerini and Butch Warns and Joschka Wohlgemuth and Luis Küffner and Théo Fuhrmann}, year = {2026}, publisher = {ai-coustics GmbH}, license = {CC BY-NC 4.0}, url = {https://ai-coustics.com} } ```

许可证：CC BY-NC-4.0 任务类别： - 音频到音频（audio-to-audio）语言： - 英语（en）标签： - 语音（speech） - 前景-背景语音（foreground-background-speech） - 语音转文字（Speech-to-Text，STT）展示名称：dawn_chorus_en 规模类别： - n<1K 配置项： - 配置名称：default 数据文件： - 划分：评估（eval）路径："eval.parquet" 数据集信息：特征： - 名称：mix 数据类型：音频：采样率：16000 - 名称：speech 数据类型：音频：采样率：16000 - 名称：transcript 数据类型：字符串 - 名称：id 数据类型：字符串 - 名称：language 数据类型：字符串 - 名称：speaker_id 数据类型：字符串 - 名称：conversation_type 数据类型：字符串 - 名称：speech_source 数据类型：字符串 - 名称：index 数据类型：整数（int64） # dawn_chorus_en 这是一款用于精准前景说话人转录的开源评估数据集。本数据集针对的混合场景为：前景语音通常可被语音转文字系统准确转录，而背景语音则被明确感知为背景音。数据集包含约90分钟的前景-背景语音混合音频，由录制与合成的前景语音构成，同时附带真实前景语音及其对应转录文本。本数据集受[DAPS](https://ccrma.stanford.edu/~gautham/Site/daps.html)启发，该数据集将语音增强定义为从真实设备录制音频到专业演播室语音的对齐输入输出对直接转换。我们围绕同样面向应用的映射设计本数据集：从真实的前景-背景语音混合音频，到可被下游语音转文字系统稳健转录的孤立主发言人语音。与DAPS一致，我们的方法强调时间对齐的参考样本与真实录制/传输条件，而非纯合成失真，从而可评估语音抑制强度与前景语音失真之间的权衡。 ## 数据集描述 ### 直接用途本数据集旨在评估在与下游语音转文字系统相关的场景中，抑制背景语音同时保留主/前景发言人的模型。推荐用途包括： - 在真实多发言人混合场景中基准测试背景语音抑制性能 - 通过对处理后的混合音频计算词错误率（Word Error Rate，WER）并与参考转录文本对比，衡量STT系统的鲁棒性 - 评估主发言人分离/目标发言人提取系统 - 对比语音增强模型在抑制强度与前景语音失真之间的权衡 ### 技术细节 - 450组等长的`mix`与`speech`样本对，总时长为01:31:19 [时:分:秒] - 最短时长：5.43秒 - 最长时长：17.77秒 - 平均时长：12.18秒 - 采样率16 kHz，16位，单声道 - 前景语音源分布：65%为录制语音（19位发言人），35%为合成语音（7位发言人） - 语音性别分布（自我声明）：44%为女声，56%为男声 - 传输信道分布：67%为GSM，16.5%为WhatsApp，16.5%为Telegram ### 数据集结构数据集中每一行包含以下字段： - **`mix`**：前景语音与背景语音混合的16 kHz WAV音频（混合音频） - **`speech`**：仅含前景语音的16 kHz WAV音频 - **`transcript`**：与`speech`音频对应的真实转录文本 - **`id`**：遵循以下格式的唯一样本标识符： `语言_发言人ID_交互类型_语音来源_索引` - **`language`**：语音片段的语言代码 - **`speaker_id`**：发言人的数字标识符 - **`conversation_type`**：语音交互类型： - `interactive`：对话式或会话型语音 - `narrative`：独白式或叙事型语音 - **`speech_source`**：前景语音的来源： - `human`：人类语音 - `machine`：机器生成语音 - **`index`**：区分同一位发言人的多个样本的整数索引 ### 数据集来源 **前景语音** - 来源分为两类： 1. 由ai-coustics录制项目生成 - 单发言人录音使用Schoeps MK4电容麦克风，在经过声学处理的近乎无回声的录音棚中录制 - 发言人与麦克风间距约10–15厘米 - 录制内容已针对口腔杂音、咔哒声、爆破音与沙沙声进行降噪与清理，未进行额外均衡或压缩，但保留了部分近讲效应 - 包含会话与叙事两种风格 2. 通过文本转语音模型合成 - 用于数据增强与韵律多样性补充 - 反映了合成语音在会话场景中应用日益广泛的真实生产场景 **背景语音** - 目标语言下的公有域非无回声、带有失真的语音录音，涵盖资讯类、会话类与叙事类语音，同时包含背景音乐与噪声 - 选取的样本旨在体现真实竞争发言人的特征（韵律、语速、发音变异性） **转录文本** - 前景语音录音由专业语言学家通过专业音频转录服务完成。所有转录文本均为人工生成，并经过质量检查以确保高准确性与语言可靠性。 ### 数据集制作流程 - 前景语音通过[人工嘴模拟器](https://www.grasacoustics.com/products/mouth-simulators/product/280-44aa)播放，靠近以下任一录制设备： 1. 三星S22手机（免提模式） - 通过以下任一方式传输音频： - GSM网络 - WhatsApp通话 - 发送至谷歌Pixel 6A手机 2. MacBook Pro M4笔记本电脑 - 通过Telegram通话传输音频 - 发送至谷歌Pixel 6A手机 - 背景语音同时通过沉浸式扬声器系统播放，并在上述录制设置中同步录制 ![幕后花絮](./dawn_chorus_bts.jpeg) *幕后花絮* ## 数据集详情 - **整理者**：Leonardo Nerini, Butch Warns, Joschka Wohlgemuth, Luis Küffner, Théo Fuhrmann - **资助方**：ai-coustics GmbH - **语言**：英语 - **许可证**：CC BY-NC 4.0 - **联系方式**： - 邮箱：info@ai-coustics.com - 官网：https://ai-coustics.com ### 引用格式 bibtex @dataset{dawn_chorus_en, title = {dawn_chorus_en: An evaluation dataset for accurate foreground speaker transcription}, author = {Leonardo Nerini and Butch Warns and Joschka Wohlgemuth and Luis Küffner and Théo Fuhrmann}, year = {2026}, publisher = {ai-coustics GmbH}, license = {CC BY-NC 4.0}, url = {https://ai-coustics.com} }

提供机构：

ai-coustics

5,000+

优质数据集

54 个

任务类型

进入经典数据集