burak-ozenc/dawn-chorus-codec-labels

Name: burak-ozenc/dawn-chorus-codec-labels
Creator: burak-ozenc
Published: 2026-04-11 10:53:01
License: 暂无描述

Hugging Face2026-04-11 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/burak-ozenc/dawn-chorus-codec-labels

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en tags: - audio - speech - codec - gsm - whatsapp - telegram - dawn-chorus pretty_name: Dawn Chorus EN — Codec Labels --- # Dawn Chorus EN - Codec Labels Sidecar dataset for [ai-coustics/dawn_chorus_en](https://huggingface.co/datasets/ai-coustics/dawn_chorus_en). Adds a `codec_guess` column, applying hypotheses to classify the audio source type(GSM, WhatsApp, Telegram) which is not present in current dataset, depending on spectral analysis of `speech` channel on original dataset. Since audio source type distribution given in the actual dataset(67% GSM, 16.5% WhatsApp, 16.5% Telegram), this classification is unsupervised, guided by the known prior distribution. ## Motivation The `dawn_chorus_en` dataset contains 3 different transmission channels without specifying the source type(GSM, WhatsApp, Telegram). Enhancement models (DeepFilterNet, NoiseReduce) report aggregated metrics (SI-SDR, PESQ, WER) that obscure codec-specific performance differences. So, I decided to run some hypotheses on the actual dataset. ## Method All audio in `dawn_chorus_en` is resampled to 16kHz, which destroys native sample rate. This classification uses two spectral features to identify what is lost during preprocessing: **1. bw_99** - the frequency below which 99% of the signal's power falls, computed via Welch PSD. GSM has a hard around 3.4kHz codec ceiling that remains detectable even after upsampling. **2. spectral_slope** - slope of a log-log linear fit of PSD above 1kHz. Steeper (more negative) = sharper rolloff = heavier codec compression. Used to separate WhatsApp from Telegram within the wide-bandwidth group. ### Decision rule ```python def classify_codec(row): if row['bw_99'] < 4472: return 'GSM' if row['spectral_slope'] < -0.45: return 'WhatsApp' return 'Telegram' ``` GSM was a hard-edged case, it has 3.4 kHz frequency ceiling. I run on a 200(to be able to handle error rate better) sample, to match 67% of GSM rate, where to set the cutoff value using `scipy.optimize.minimize_scalar`. ``` 4000 Hz - 110 GSM - too few, distance = 24 5000 Hz - 160 GSM - too many, distance = 26 4472 Hz - 134 GSM - closest to 134, distance = 0 ``` ## Results | codec_guess | count | proportion | expected | |-------------|-------|------------|----------| | GSM | 299 | 66.4% | 67.0% | | WhatsApp | 88 | 19.6% | 16.5% | | Telegram | 63 | 14.0% | 16.5% | GSM classification is highly reliable (0.6% off expected). WhatsApp and Telegram are slightly misclassified into each other (around 10-15 samples) due to spectral slope overlap in the ambiguous zone after resampling. ## Key findings from cross-tab analysis - **WhatsApp is 64% machine-generated speech** - the highest synthetic voice ratio across all three codecs. GSM and Telegram are 70-83% human. - **Telegram has 35% narrative content** vs around 7% for GSM/WhatsApp - reflecting how each platform was used during collection (monologue sharing vs phone calls). - **Codec is speaker-specific** - certain speakers appear almost exclusively in one codec. Speaker identity and codec are correlated, meaning aggregated benchmark metrics cannot isolate codec effect from speaker characteristics. ## Schema | column | type | description | |-------------------|--------|-----------------------------------------------| | id | string | original sample id from dawn_chorus_en | | speaker_id | string | speaker identifier | | language | string | always `en` | | conversation_type | string | `interactive` or `narrative` | | speech_source | string | `human` or `machine` | | index | int64 | original sample index | | bw_95 | float | frequency below which 95% of power falls (Hz) | | bw_99 | float | frequency below which 99% of power falls (Hz) | | spectral_slope | float | log-log PSD slope above 1kHz | | ceiling_ratio | float | energy ratio 4k-8k band / 3k-4k band | | vad_rate | float | silence transitions per second | | spectral_flatness | float | mean spectral flatness | | mfcc_hi_std | float | std of MFCC coefficients 9-13 | | r_0_3k | float | normalized energy share 0-3kHz | | r_3k_4k | float | normalized energy share 3-4kHz | | r_4k_8k | float | normalized energy share 4-8kHz | | r_8k_p | float | normalized energy share 8kHz+ | | codec_guess | string | `GSM`, `WhatsApp`, or `Telegram` | ## Usage ```python from datasets import load_dataset labels = load_dataset("burak-ozenc/dawn-chorus-codec-labels", split="train") df = labels.to_pandas() # join with original dataset by id # df.merge(your_results_df, on='id') # filter by codec gsm_only = df[df['codec_guess'] == 'GSM'] ``` ## Limitations - Classification is semi-supervised - no ground truth labels exist in the original dataset - WhatsApp/Telegram boundary is soft due to spectral slope overlap after 16kHz resampling - All features extracted from the `speech` channel (clean), not the `mix` channel - Eval split only (450 samples) ## Citation ``` @dataset{dawn_chorus_en, title = {dawn_chorus_en: An evaluation dataset for accurate foreground speaker transcription}, author = {Leonardo Nerini and Butch Warns and Joschka Wohlgemuth and Luis Küffner and Théo Fuhrmann}, year = {2026}, publisher = {ai-coustics GmbH}, license = {CC BY-NC 4.0}, url = {https://ai-coustics.com} } ```

提供机构：

burak-ozenc

搜集汇总

数据集介绍

构建方式

在语音增强与音频处理领域，准确识别传输编解码器类型对于评估模型性能至关重要。本数据集作为原始数据集“dawn_chorus_en”的辅助标注集，通过无监督分类方法构建。其核心在于利用已知的音频源类型先验分布，即GSM占67%、WhatsApp与Telegram各占16.5%，对原始数据中未标注的传输类型进行推断。具体方法基于频谱分析，提取了两个关键特征：一是信号功率99%所覆盖的带宽上限，用以捕捉GSM编解码器固有的3.4kHz硬性频率限制；二是频谱斜率，通过拟合1kHz以上功率谱密度的对数线性关系，区分宽带组内的WhatsApp与Telegram。决策规则依据这些特征设定阈值，实现了对GSM的高可靠性分类，而WhatsApp与Telegram之间因重采样后的频谱重叠存在轻微误判。

特点

该数据集在语音技术研究中展现出鲜明的结构性特征。其标注列“codec_guess”提供了对音频源类型的假设分类，包括GSM、WhatsApp和Telegram三类，有效弥补了原始数据集中传输渠道信息缺失的空白。数据集蕴含丰富的声学特征，如带宽指标、频谱斜率、频谱平坦度及梅尔频率倒谱系数标准差等，这些特征均从纯净语音通道提取，为深入分析编解码器特性提供了多维视角。尤为值得注意的是，跨表分析揭示了平台使用模式与语音内容的关联性，例如Telegram中叙事性内容比例显著较高，而WhatsApp则包含更多机器生成语音。这些特征使得数据集不仅能用于编解码器分类，还可支持语音合成、说话人识别及平台行为差异等交叉研究。

使用方法

在语音增强模型评估与音频分析任务中，本数据集提供了便捷的集成方案。用户可通过Hugging Face的datasets库直接加载数据集，并利用Pandas转换为数据框格式进行操作。典型应用场景包括将本数据集的标注信息与原始“dawn_chorus_en”数据集通过样本ID进行合并，从而为每个音频样本赋予编解码器假设标签。研究者可依据“codec_guess”列轻松筛选特定编解码器下的样本子集，例如专注于GSM音频的分析，以探究不同传输压缩算法对语音质量指标的影响。这种使用方法使得原本聚合的评估指标得以按编解码器分解，有助于更精细地揭示模型在不同传输环境下的性能差异，推动语音处理技术向更适应实际应用场景的方向发展。

背景与挑战

背景概述

在语音增强与音频处理领域，准确评估模型在不同传输条件下的性能至关重要。Dawn Chorus EN - Codec Labels 数据集由 ai-coustics 团队于2026年发布，作为原始数据集 dawn_chorus_en 的辅助标注集。该数据集旨在解决原始数据中缺失的音频源类型标注问题，通过无监督分类方法为语音样本推断其传输编解码器类型，包括 GSM、WhatsApp 和 Telegram。其核心研究问题聚焦于揭示不同编解码器对语音信号频谱特性的影响，从而支持更精细的语音增强模型评估，推动音频处理技术在真实通信场景中的应用。

当前挑战

该数据集面临的挑战主要体现在两个方面：其一，在领域问题层面，语音增强模型通常依赖聚合指标进行评估，这掩盖了不同编解码器导致的性能差异，难以精准量化压缩算法对语音质量的影响；其二，在构建过程中，由于原始音频被重采样至16kHz，原生采样率信息丢失，使得编解码器分类必须依赖频谱特征推断，而 WhatsApp 与 Telegram 的频谱斜率存在重叠区域，导致边界模糊，分类准确性受到限制。此外，数据标注缺乏真实标签，仅能依据先验分布进行半监督分类，进一步增加了结果的不确定性。

常用场景

经典使用场景

在音频处理与通信技术领域，Dawn Chorus Codec Labels数据集作为辅助标注资源，其经典应用场景聚焦于语音编码类型的无监督分类研究。通过分析音频信号的频谱特征，如带宽和频谱斜率，该数据集能够有效区分GSM、WhatsApp和Telegram等不同传输通道的编码特性。这一过程不仅深化了对语音信号在压缩与传输过程中频谱变化的理解，还为后续的语音增强模型提供了细粒度的编码类别标签，从而支持更精准的性能评估与优化。

解决学术问题

该数据集主要解决了语音处理研究中因编码类型不明确而导致的性能评估模糊问题。传统语音增强模型在评估时往往聚合不同编码的指标，掩盖了编码特异性差异，而Dawn Chorus Codec Labels通过频谱分析提供了编码猜测标签，使研究者能够分离编码效应与说话人特征。这有助于揭示不同编码压缩对语音质量的影响，推动编码感知的语音处理技术的发展，并为无监督分类方法在音频领域的应用提供了实证案例。

衍生相关工作

围绕该数据集衍生的经典工作主要包括编码特异性语音增强模型的开发与评估。研究者利用其提供的编码标签，训练了能够区分GSM、WhatsApp和Telegram的深度学习模型，如基于频谱特征的分类器或端到端的增强网络。这些工作进一步探索了编码与说话人身份、内容类型的相关性，推动了跨平台语音处理基准的建立，并为通信音频的细粒度分析开辟了新的研究方向，例如编码不变性表示学习或自适应后处理技术。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集