manassehzw/sna-dataset-annotated
收藏Hugging Face2026-03-23 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/manassehzw/sna-dataset-annotated
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: audio
dtype: audio
- name: transcription
dtype: string
- name: source_id
dtype: string
- name: speaker_id
dtype: int32
- name: speaker_clip_count
dtype: int32
- name: language
dtype: string
- name: gender
dtype: string
- name: has_punctuation
dtype: bool
- name: snr_db
dtype: float32
- name: speech_ratio
dtype: float32
- name: quality_score
dtype: float32
- name: duration
dtype: float32
- name: speaker_assignment_confidence
dtype: float32
splits:
- name: train
num_examples: 12170
- name: validation
num_examples: 1504
- name: test
num_examples: 1565
license: cc-by-4.0
task_categories:
- automatic-speech-recognition
- text-to-speech
language:
- sna
tags:
- audio
- speech
- shona
- asr
- tts
- african-language
source_datasets:
- google/WaxalNLP
pretty_name: Shona Speech Dataset (SNA) - Annotated
size_categories:
- 10K<n<100K
---
# manassehzw/sna-dataset-annotated
An annotated, speaker-relabelled, and loudness-normalised Shona (`sna`) speech dataset prepared through a reproducible Modal-based data engineering pipeline.
This release addresses speaker label contamination in the original source labels by replacing identity columns with acoustically-derived speaker assignments.
### Dataset Description
- **Curated by:** [Manasseh Changachirere (Harare Institute of Technology)](https://www.manasseh.dev/)
- **Derived from:** [google/WaxalNLP](https://huggingface.co/datasets/google/WaxalNLP/) & [manasseh-zw/sna-dataset](https://huggingface.co/datasets/manassehzw/sna-dataset)
- **Language:** Shona 🇿🇼
- **Repository:** `manassehzw/sna-dataset-annotated`
- **Total Clips:** 15239
- **Total Speech Hours:** 78.501
- **Unique Speakers:** 46
## Why this annotated release exists
The original source speaker labels are contaminated (multiple voices assigned to the same identity). This release replaces source identity labels with programmatically derived speaker clusters and rederived gender labels.
### Pre/Post relabelling snapshot
- **Pre-classification clips:** 16980
- **Pre-classification unique source speakers:** 133
- **Post-classification speaker clusters:** 46
- **Noise clips dropped after rescue:** 1741
- **Noise clips rescued by centroid similarity:** 794
## Processing Summary
1. **Speaker relabelling** using ECAPA embeddings + HDBSCAN + noise rescue.
2. **Gender relabelling** using a Shona-calibrated Logistic Regression classifier trained on ECAPA embeddings.
3. **Noise drop** (`cluster_id == -1`) and schema rebuild.
4. **Loudness normalisation** to -23 LUFS (EBU R128) with clipping protection.
5. **Speaker-stratified split** into train/validation/test.
## Relabelling Method
- **Speaker embeddings:** `speechbrain/spkrec-ecapa-voxceleb` (192-d)
- **Clustering:** HDBSCAN (`min_cluster_size=50`, `min_samples=10`, metric `euclidean`, method `eom`)
- **Noise rescue:** cosine similarity threshold `0.75`
- **Gender model:** Logistic Regression on L2-normalised ECAPA embeddings
- Training clips (female): 160
- Training clips (male): 152
- Train accuracy: 1.0
- CV accuracy: 1.0
## Loudness Normalisation
- **Target:** -23 LUFS
- **Skip tolerance:** +/-1 LU
- **Post-gain protection:** hard clip to [-1.0, 1.0]
- **Input LUFS mean/std:** -22.659 / 5.301
- **Output LUFS mean/std:** -22.999 / 0.243
## Splits
- Train: 12170
- Validation: 1504
- Test: 1565
Split strategy is speaker-stratified by clip proportion, preserving speaker distribution across splits.
## Data Fields
- **`audio`**: 24kHz mono float audio (loudness-normalised)
- **`transcription`**: normalized Shona transcription
- **`source_id`**: original clip identifier from source dataset
- **`speaker_id`**: acoustically-derived speaker class id
- **`speaker_clip_count`**: clip count for the assigned speaker_id
- **`language`**: language code (`sna`)
- **`gender`**: cluster-level resolved label (`Female` / `Male` / `Unknown`)
- **`has_punctuation`**: punctuation indicator from normalized transcript
- **`snr_db`**: signal-to-noise proxy metric
- **`speech_ratio`**: fraction of VAD frames classified as speech
- **`quality_score`**: composite quality metric
- **`duration`**: clip duration in seconds
- **`speaker_assignment_confidence`**: confidence for speaker assignment
## Uses
### Direct use
- Shona ASR model training and adaptation
- TTS subset construction by filtering on speaker and quality metadata
- Speech data quality analysis and dataset curation workflows
### Out-of-scope
- Identity verification / forensic use without additional validation
- Demographic representativeness claims without dedicated study
## Bias, Risks, and Limitations
- Inherits source demographic/dialect distribution.
- Relabelled speaker IDs are acoustic clusters, not identity-verified persons.
- Confidence values are useful for filtering, not absolute truth scores.
- Some residual label uncertainty can remain in ambiguous/noisy clips.
## Citation
If you use this dataset, cite both this release and the source dataset:
```bibtex
@inproceedings{niang2024waxalnlp,
title={WaxalNLP: A Large Scale High Quality Speech Dataset for African Languages},
author={Niang, El Hadj Mamadou and Dieng, Moustapha and Ba, Thierno Ibrahima and Ndiaye, Mamadou Boumedine and others},
booktitle={Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)},
year={2024}
}
```
dataset_info:
特征:
- 名称: audio
数据类型: 音频
- 名称: transcription
数据类型: 字符串
- 名称: source_id
数据类型: 字符串
- 名称: speaker_id
数据类型: 32位整数(int32)
- 名称: speaker_clip_count
数据类型: 32位整数(int32)
- 名称: language
数据类型: 字符串
- 名称: gender
数据类型: 字符串
- 名称: has_punctuation
数据类型: 布尔型(bool)
- 名称: snr_db
数据类型: 32位浮点数(float32)
- 名称: speech_ratio
数据类型: 32位浮点数(float32)
- 名称: quality_score
数据类型: 32位浮点数(float32)
- 名称: duration
数据类型: 32位浮点数(float32)
- 名称: speaker_assignment_confidence
数据类型: 32位浮点数(float32)
划分集:
- 名称: train
样本数量: 12170
- 名称: validation
样本数量: 1504
- 名称: test
样本数量: 1565
许可证: cc-by-4.0(知识共享署名4.0协议)
任务类别:
- 自动语音识别(automatic-speech-recognition)
- 文本转语音(text-to-speech)
语言:
- sna(绍纳语)
标签:
- 音频(audio)
- 语音(speech)
- 绍纳语(shona)
- 自动语音识别(asr)
- 文本转语音(tts)
- 非洲语言(african-language)
源数据集:
- google/WaxalNLP
友好名称: 带注释的绍纳语语音数据集(SNA)
样本规模分类:
- 10K<n<100K
# manassehzw/sna-dataset-annotated
本数据集为经过注释、说话人重标记并完成响度归一化的绍纳语(sna)语音数据集,通过可复现的基于Modal(Modal平台)的数据工程流程构建。本版本解决了原始源标签中的说话人标签污染问题,将原身份列替换为基于声学特征推导的说话人分配结果。
### 数据集详情
- **整理方**:[Manasseh Changachirere(哈拉雷理工学院)](https://www.manasseh.dev/)
- **数据来源**:[google/WaxalNLP](https://huggingface.co/datasets/google/WaxalNLP/) 与 [manasseh-zw/sna-dataset](https://huggingface.co/datasets/manassehzw/sna-dataset)
- **语言**:绍纳语 🇿🇼
- **仓库地址**:`manassehzw/sna-dataset-annotated`
- **总音频片段数**:15239
- **总语音时长**:78.501小时
- **唯一说话人数量**:46
## 本带注释版本的开发背景
原始源说话人标签存在污染问题(同一身份被分配多个不同语音)。本版本将原身份标签替换为程序推导的说话人聚类结果,并重新推导了性别标签。
### 重标记前后快照
- **分类前音频片段数**:16980
- **分类前唯一源说话人数量**:133
- **分类后说话人聚类数**:46
- **噪声救援后移除的片段数**:1741
- **通过质心相似度救援的噪声片段数**:794
## 处理流程总结
1. **说话人重标记**:采用ECAPA嵌入(ECAPA embedding)+ HDBSCAN聚类 + 噪声救援流程
2. **性别重标记**:使用针对绍纳语校准的逻辑回归分类器,以ECAPA嵌入作为训练特征
3. **噪声移除与结构重构**:移除聚类ID为-1的噪声片段,并重构数据集结构
4. **响度归一化**:将音频归一化至-23 LUFS(符合EBU R128标准),并添加削波保护机制
5. **说话人分层划分**:按比例将数据集划分为训练集、验证集与测试集,确保各划分集的说话人分布一致
## 重标记方法
- **说话人嵌入模型**:`speechbrain/spkrec-ecapa-voxceleb`(192维)
- **聚类算法**:HDBSCAN,参数设置为`min_cluster_size=50`、`min_samples=10`,采用欧氏距离(euclidean)度量与`eom`聚类方法
- **噪声救援阈值**:余弦相似度阈值设为0.75
- **性别分类模型**:基于L2归一化ECAPA嵌入的逻辑回归模型
- 女性训练样本数:160
- 男性训练样本数:152
- 训练集准确率:1.0
- 交叉验证准确率:1.0
## 响度归一化设置
- **目标响度**:-23 LUFS
- **容忍偏差范围**:±1 LU
- **后期增益保护**:将音频幅值硬限制在[-1.0, 1.0]范围内
- **输入LUFS均值/标准差**:-22.659 / 5.301
- **输出LUFS均值/标准差**:-22.999 / 0.243
## 数据集划分
- 训练集:12170
- 验证集:1504
- 测试集:1565
划分策略为基于说话人分层的按比例划分,保留各划分集中的说话人分布一致性。
## 数据字段说明
- **`audio`**:24kHz单声道浮点型音频(已完成响度归一化)
- **`transcription`**:标准化后的绍纳语转录文本
- **`source_id`**:源数据集中的原始片段标识符
- **`speaker_id`**:基于声学特征推导的说话人类别ID
- **`speaker_clip_count`**:当前分配的说话人ID对应的总音频片段数
- **`language`**:语言代码(`sna`)
- **`gender`**:聚类层面解析得到的性别标签,可选值为`Female`(女性)、`Male`(男性)或`Unknown`(未知)
- **`has_punctuation`**:标准化转录文本是否包含标点符号的布尔标识
- **`snr_db`**:信噪比代理指标
- **`speech_ratio`**:经语音活动检测(Voice Activity Detection,简称VAD)分类为语音的帧占比
- **`quality_score`**:综合语音质量指标
- **`duration`**:音频片段时长,单位为秒
- **`speaker_assignment_confidence`**:说话人分配结果的置信度
## 应用场景
### 直接适用场景
- 绍纳语自动语音识别(ASR)模型的训练与适配
- 通过筛选说话人与质量元数据构建文本转语音(TTS)子集
- 语音数据质量分析与数据集整理流程
### 不适用场景
- 未经额外验证的身份验证/法医用途
- 未经过专门研究的人口统计代表性声明
## 偏差、风险与局限性
- 本数据集继承了源数据集的人口统计与方言分布特征
- 重标记后的说话人ID为声学聚类结果,而非经过身份验证的自然人
- 置信度值仅适用于样本筛选操作,而非绝对的真实性评分
- 部分模糊或噪声较大的片段仍可能存在残留的标签不确定性
## 引用说明
若使用本数据集,请同时引用本版本与源数据集:
bibtex
@inproceedings{niang2024waxalnlp,
title={WaxalNLP: A Large Scale High Quality Speech Dataset for African Languages},
author={Niang, El Hadj Mamadou and Dieng, Moustapha and Ba, Thierno Ibrahima and Ndiaye, Mamadou Boumedine and others},
booktitle={Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)},
year={2024}
}
提供机构:
manassehzw



